Trino (SQL query engine): Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 11:35, 24 September 2021 edit Brianolsen2 (talk \| contribs) 79 edits Add Trino name explanation, add citations, and reword tradeoff for checkpoints ← Previous edit		Latest revision as of 17:24, 27 December 2024 edit undo LesterMartin (talk \| contribs) 2 edits m showing breadth of file formats, not just columnar ones
(24 intermediate revisions by 18 users not shown)
Line 1: {{Short description\|Open-source distributed SQL query engine}} {{Infobox software \| name = Trino Line 6 ⟶ 7: \| caption = Trino UI Version 358 \| author = Martin Traverso, Dain Sundstrom, David Phillips, Eric Hwang ~~\| released = {{Start date and age\|10 November 2013}}~~ \| programming language = [[Java (programming language)\|Java]] \| operating system = [[Cross-platform]] Line 16: }} '''Trino''' is an [[Open-source software\|open-source]] distributed [[SQL]] query engine designed to query large data sets distributed over one or more heterogeneous data sources.<ref>{{cite web \|title=Overview — Trino ~~361~~468 Documentation \|url=https://trino.io/docs/~~361~~468/overview.html \|website=trino.io \|access-date=2027 ~~September~~December ~~2021~~2024}}</ref> Trino ~~is commonly used as a~~can query ~~engine over~~ [[~~Data_lake\|datalakes]] and [[Data Warehouse\|~~data ~~warehouses~~lake]]s ~~using~~that ~~the~~contain ~~[[Apache~~a ~~Hive\|Hive]] and [[List~~variety of ~~Apache~~file ~~Software~~formats ~~Foundation~~such ~~projects#Active~~as ~~projects\|Iceberg]]<ref name="iceberg">{{cite web \|title=About~~simple row-oriented ~~Apache~~CSV ~~Iceberg~~and ~~\|url=http://iceberg.apache.org/~~JSON ~~\|website=iceberg.apache.org~~data ~~\|access-date=18~~files ~~September~~to ~~2021}}</ref>~~more ~~table formats. In these configurations Trino queries can query data in~~performant [[Free and open source software\|open]] [[Column-oriented DBMS\|column-oriented]] data file formats like [[Apache ORC\|ORC]] or [[Apache Parquet\|Parquet]]<ref name="hive-connector" /><ref name="iceberg-connector" /> residing on different storage systems like [[Apache Hadoop#Hadoop distributed file system\|HDFS]], [[Amazon S3\|AWS S3]], [[Google Cloud Storage]], or [[Microsoft Azure#Storage services\|Azure Blob Storage]]<ref name="trino-definitive-guide-ch1" /> using the [[Apache Hive\|Hive]]<ref name="hive-connector">{{cite web \|title=Hive connector — Trino 393 Documentation \|url=https://trino.io/docs/393/connector/hive.html \|website=trino.io}}</ref> and [[List of Apache Software Foundation projects#Active projects\|Iceberg]]<ref name="iceberg-connector">{{cite web \|title=Iceberg connector — Trino 393 Documentation \|url=https://trino.io/docs/393/connector/iceberg.html \|website=trino.io \|access-date=25 August 2022}}</ref> table formats. Trino also has the ability to run federated queries ~~across~~that ~~multiple~~query ~~disparate~~tables in different data sources such as [[MySQL]], [[PostgreSQL]], [[Apache Cassandra\|Cassandra]], [[Apache Kafka\|Kafka]], [[MongoDB]] and [[Elasticsearch]].<ref>{{cite web \|title=Connectors — Trino is393 ~~community~~Documentation ~~driven~~\|url=https://trino.io/docs/393/connector.html ~~and~~\|website=trino.io \|access-date=25 August 2022}}</ref> Trino is released under the [[Apache License]].<ref>{{cite web \|title=trinodb/trino LICENSE \|url=https://github.com/trinodb/trino/blob/master/LICENSE \|publisher=Trino \|access-date=25 August 2022 \|date=25 August 2022}}</ref> == History == Trino was originally designed and developed by Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang at [[Facebook]] to allow data analysts to run interactive queries on its large [[data warehouse]] in [[Apache Hadoop]]. The project was originally named [[Presto (SQL query engine)\|Presto]] and shares the first six years of development with the Presto project.<ref>{{cite web \|title=Contributors to trinodb/trino \|url=https://github.com/trinodb/trino/graphs/contributors?from=2012-08-05&to=2018-08-05&type=c \|website=GitHub \|access-date=20 September 2021 \|language=en}}</ref><ref>{{cite web \|title=Contributors to prestodb/presto \|url=https://github.com/prestodb/presto/graphs/contributors?from=2012-08-05&to=2018-08-05&type=c \|website=GitHub \|access-date=20 September 2021 \|language=en}}</ref> Before Presto, data analysts at Facebook relied on [[Apache Hive]], which was too slow for running interctive SQL analytics on their 250 petabyte data warehouse.<ref name="2013facebook">{{Cite news\|url=http://www.computerworld.com/article/2485668/business-intelligence/facebook-goes-open-source-with-query-engine-for-big-data.html\|title=Facebook goes open source with query engine for big data\|author=Joab Jackson\|date=November 6, 2013\|work=Computer World\|access-date=April 26, 2017}}</ref> ▼ In January 2019, the ~~Trino~~original ~~Software~~creators ~~Foundation~~of [[Presto (~~formerly~~SQL query engine)\|Presto]], ~~Software~~Martin ~~Foundation~~Traverso, Dain Sundstrom, and David Phillips, created a [[Fork (software development)\|fork]] ~~was~~of the Presto project. They initially kept the name Presto and used the PrestoSQL web handle to distinguish it from the original PrestoDB project. Simultaneously, they announced the Presto Software Foundation. The foundation is a not-for-profit organization dedicated to the advancement of the ~~Trino~~Presto open source distributed SQL query engine.<ref name="2019psf">{{Cite web\|url=https://www.prweb.com/releases/~~presto_software_foundation_launches_to_advance_presto_open_source_community/prweb16070792~~presto-software-foundation-launches-to-advance-presto-open-source-community-815915772.~~htm~~html\|title=Presto Software Foundation Launches to Advance Presto Open Source Community\|website=PRWeb\|access-date=2019-02-01}}</ref><ref name="2019psf2">{{Cite web\|url=https://thenewstack.io/prestos-new-foundation-signals-growth-for-the-big-data-sql-engine/\|title=Presto's New Foundation Signals Growth for the Big Data SQL Engine\|date=2019-01-31\|website=The New Stack\|language=en-US\|access-date=2019-02-01}}</ref>▼ Martin, Dain, David, and Eric began development in 2012 and they deployed an initial version later that year. Later, Facebook announced its release as open source late Fall of 2013.<ref name="2013facebook" /><ref name="2013facebook2">{{Cite news\|url=https://gigaom.com/2013/06/06/facebook-unveils-presto-engine-for-querying-250-pb-data-warehouse/\|title=Facebook unveils Presto engine for querying 250 PB data warehouse\|author=Jordan Novet\|date=June 6, 2013\|work=Giga Om\|access-date=April 26, 2017}}</ref> As Presto gained popularity, many well known companies, such as [[Netflix]],<ref>{{Cite news\|url=http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html\|title=Using Presto in our Big Data Platform on AWS\|authors=Eva Tse, Zhenxiao Luo, Nezih Yigitbasi\|date=October 7, 2014\|work=Netflix technical blog\|access-date=April 26, 2017}}</ref> [[AirBnB]],<ref>{{cite web \|title=Airpal: a Web UI for PrestoDB \|url=https://medium.com/airbnb-engineering/airpal-a-web-based-query-execution-tool-for-data-analysis-33c43265ed1f \|website=Medium \|access-date=20 September 2021 \|language=en \|date=4 April 2016}}</ref> among others, disclosed they used Presto in both on premise and cloud deployments at equivalent petabyte scales. In late 2016, Amazon released that it would provide Presto as a service called Athena.<ref>{{cite web \|title=AWS Launches Amazon Athena {{!}} Amazon.com, Inc. - Press Room \|url=https://press.aboutamazon.com/news-releases/news-release-details/aws-launches-amazon-athena \|website=press.aboutamazon.com \|access-date=20 September 2021 \|language=en}}</ref> In ~~late~~December ~~2018~~2020, aPrestoSQL ~~disagreement~~was ~~around~~rebranded ~~the~~as ~~stewardship~~Trino. ofThe ~~Presto~~Trino ~~between~~Software ~~the~~Foundation, ~~founders~~code base, and ~~Facebook~~all ~~formed~~other asPrestoSQL ~~Facebook~~assets ~~management~~were ~~pushed~~renamed toas ~~have~~part ~~tighter control over~~of the ~~project~~rebrand.<ref name="2020rename">{{cite web \|last1=Traverso \|first1=Martin \|last2=Sundstrom \|first2=Dain \|last3=Phillips \|first3=David \|title=~~We’re~~We're rebranding PrestoSQL as Trino \|url=https://trino.io/blog/2020/12/27/announcing-trino.html \|website=trino.io \|access-date=7 September 2021 \|language=en \|date=27 December 2020}}</ref> This move included giving automatic committership rights to Facebook developers without prior experience with the project.<ref name="2020rename"/> Shortly after Facebook management moved forward with these changes, the creators left the original Presto project to create a fork.<ref name="2020rename"/> This fork was also initially named Presto, so to differentiate them, users called the original project PrestoDB and the fork PrestoSQL named after their respective web addresses, https://prestodb.io and [https://trino.io https://prestosql.io].<ref name="380issue">{{cite web \|title=What is the relationship of prestosql and prestodb? · Issue #380 · trinodb/trino \|url=https://github.com/trinodb/trino/issues/380 \|website=GitHub \|access-date=24 September 2021 \|language=en}}</ref> It is worth noting that this split has striking similarities to the [[Jenkins (software)#History\|Jenkins and Hudson split]]. ▲Presto and Trino ~~was~~were originally designed and developed by Martin ~~Traverso~~, Dain ~~Sundstrom~~, David ~~Phillips~~, and Eric Hwang at [[Facebook]] to allow data analysts to run interactive queries on its large [[data warehouse]] in [[Apache Hadoop]]. ~~The project was originally named [[Presto (SQL query engine)\|Presto]] and~~Trino shares the first six years of development with the Presto project.<ref>{{cite web \|title=Contributors to trinodb/trino \|url=https://github.com/trinodb/trino/graphs/contributors?from=2012-08-05&to=2018-08-05&type=c \|website=GitHub \|access-date=20 September 2021 \|language=en}}</ref><ref>{{cite web \|title=Contributors to prestodb/presto \|url=https://github.com/prestodb/presto/graphs/contributors?from=2012-08-05&to=2018-08-05&type=c \|website=GitHub \|access-date=20 September 2021 \|language=en}}</ref> ~~Before~~To ~~Presto,~~learn ~~data~~more ~~analysts~~about atthe ~~Facebook~~earlier ~~relied~~history onof ~~[[Apache Hive]]~~Trino, ~~which~~you ~~was~~can ~~too~~reference ~~slow for running interctive~~[[Presto (SQL analytics on their 250 petabyte data warehouse.<ref name="2013facebook">{{Cite news\|url=http://www.computerworld.com/article/2485668/business-intelligence/facebook-goes-open-source-with-query-engine-for-big-data.html\|title=Facebook goes open source with query engine ~~for big data~~)#History\|~~author=Joab~~the ~~Jackson\|date=November~~Presto ~~6, 2013\|work=Computer World\|access-date=April 26, 2017}}</ref>~~history section]]. ▲In January 2019, the Trino Software Foundation (formerly Presto Software Foundation) was announced. The foundation is a not-for-profit organization dedicated to the advancement of the Trino open source distributed SQL query engine.<ref name="2019psf">{{Cite web\|url=https://www.prweb.com/releases/presto_software_foundation_launches_to_advance_presto_open_source_community/prweb16070792.htm\|title=Presto Software Foundation Launches to Advance Presto Open Source Community\|website=PRWeb\|access-date=2019-02-01}}</ref><ref name="2019psf2">{{Cite web\|url=https://thenewstack.io/prestos-new-foundation-signals-growth-for-the-big-data-sql-engine/\|title=Presto's New Foundation Signals Growth for the Big Data SQL Engine\|date=2019-01-31\|website=The New Stack\|language=en-US\|access-date=2019-02-01}}</ref> Trino is used in many data platforms and products from cloud providers and other vendors. Customization of these products varies from pure Trino usage to heavily customized systems to run a data platform or integration in specialized data platforms for usage with specific data. [https://trino.io/users Examples include Amazon Athena, Starburst Galaxy, Dune, and many others.] In September 2019, Facebook donated PrestoDB to the [[Linux Foundation]] establishing the Presto Foundation.<ref>{{Cite web\|url=https://www.linuxfoundation.org/press-release/2019/09/facebook-uber-twitter-and-alibaba-form-presto-foundation-to-tackle-distributed-data-processing-at-scale/\|title=Facebook, Uber, Twitter and Alibaba form Presto Foundation to Tackle Distributed Data Processing at Scale\|access-date=2019-11-12}}</ref> Neither the creators of Presto, nor the top contributors and committers, were invited to join this foundation.<ref name="2019comment">{{Cite news\|url=https://github.com/trinodb/trino/issues/380#issuecomment-557691046\|title=What's the relationship between prestosql and prestodb?\|date=2019-11-22}}</ref><ref name="2020rename"/> In December 2020, PrestoSQL was rebranded as Trino.<ref name="2020rename"/> The name comes from a shortening of the physics particle [[neutrino]], for its fast and light properties. The name Trino is shorter, sounds better, and is easier to search for on the web. <ref>{{cite web \|title=8: Trino: A ludicrously fast query engine: past, present, and future \|url=https://trino.io/episodes/8.html \|website=trino.io \|access-date=24 September 2021 \|language=en \|date=11 January 2021}}</ref> == Architecture == [[File:Figure 4-1 Trino architecture.png\|thumb\|Trino architecture overview with coordinator and workers<ref name="trino-definitive-guide-ch4">{{cite book \|last1=Fuller \|first1=Matt \|last2=Moser \|first2=Manfred \|last3=Traverso \|first3=Martin \|title=Trino: The Definitive Guide \|chapter=Chapter 4. Trino Architecture \|date=2021 \|publisher=O'Reilly Media, Inc, USA \|isbn=9781098107710 \|pages=43–72}}</ref>]] Trino is written in [[Java (programming language)\|Java]].<ref name="trino-definitive-guide-ch2">{{cite book \|last1=Fuller \|first1=Matt \|last2=Moser \|first2=Manfred \|last3=Traverso \|first3=Martin \|title=Trino: The Definitive Guide \|chapter=Chapter 2. Installing and Configuring Trino \|date=2021 \|publisher=O'Reilly Media, Inc, USA \|isbn=9781098107710 \|pages=19–24}}</ref> It runs on a cluster of servers that contains two types of nodes, a '''coordinator''' and a '''worker'''.<ref name="trino-definitive-guide-ch4" /> * The coordinator is responsible for parsing, analyzing, optimizing, planning, and scheduling a query submitted by a client. The coordinator interacts with the [[service provider interface]] (SPI) to obtain the available tables, ~~obtain~~ table statistics~~, check permissions~~, and other information needed to carry out its tasks.<ref name="trino-definitive-guide-ch4" /> * The workers are responsible for executing the tasks and operators fed to itthem by the scheduler. These tasks process rows from the data sources ~~and~~which produce results that are returned to the coordinator and ultimately back to the client.<ref name="trino-definitive-guide-ch4" />▼ Trino adheres to the [[ANSI]] [[SQL]]<ref name="trino-definitive-guide-ch1">{{cite book \|last1=Fuller \|first1=Matt \|last2=Moser \|first2=Manfred \|last3=Traverso \|first3=Martin \|title=Trino: The Definitive Guide \|chapter=Chapter 1. Introducing Trino \|date=2021 \|publisher=O'Reilly Media, Inc, USA \|isbn=9781098107710 \|pages=3–17}}</ref> standard and includes various parts of the following ANSI specifications: [[SQL-92]], [[SQL:1999]], [[SQL:2003]], [[SQL:2008]], [[SQL:2011]], [[SQL:2016]], [[SQL:2023]]. ▲* The workers are responsible for executing the tasks and operators fed to it by the scheduler. These tasks process rows from data sources and produce results that are returned to the coordinator and ultimately back to the client. Trino supports the separation of compute and storage<ref name="trino-definitive-guide-ch1" /> and may be deployed both on-premises and in the [[Cloud computing\|cloud]].<ref name="trino-definitive-guide-ch13">{{cite book \|last1=Fuller \|first1=Matt \|last2=Moser \|first2=Manfred \|last3=Traverso \|first3=Martin \|title=Trino: The Definitive Guide \|chapter=Chapter 13. Real-World Examples \|date=2021 \|publisher=O'Reilly Media, Inc, USA \|isbn=9781098107710 \|pages=267–272}}</ref> Trino attempts to follow the [[ANSI]] [[SQL]] standard as closely as possible to include: [[SQL-92]], [[SQL:1999]], [[SQL:2003]], [[SQL:2008]], [[SQL:2011]], [[SQL:2016]]. Trino favors implementing SQL features more relevant to [[OLAP]] over [[OLTP]]. Trino has a [[Distributed computing~~\|distributed~~]] [[massively parallel\|MPP]] architecture,.<ref ~~which~~name="trino-definitive-guide-ch4" ~~was a big departure from the map reduce design used by most popular data lake systems like Hive, Impala, and [[Apache Spark]].~~/> Trino first distributes work over multiple workers by running ad-hoc partitioning operations or relying on existing partitions in the data of the underlying data store. Once this data has reached the worker, the data is processed over pipelined operators carried out on multiple threads.<ref ~~Another~~name="trino-definitive-guide-ch4" decided characteristic of Trino was avoiding the [[Application checkpointing\|checkpointing]] operations involving expensive writes, used by systems like Hive and Spark. Avoiding these writes may require restarting a query in the rare case of failure during the operation./>▼ ~~Trino supports separation of compute and storage and may be deployed both on premises and in the [[Cloud computing\|cloud]].~~ ▲Trino has a [[Distributed computing\|distributed]] [[massively parallel\|MPP]] architecture, which was a big departure from the map reduce design used by most popular data lake systems like Hive, Impala, and [[Apache Spark]]. Trino first distributes work over multiple workers by running ad-hoc partitioning operations or relying on existing partitions in the data of the underlying data store. Once this data has reached the worker, the data is processed over pipelined operators carried out on multiple threads. Another decided characteristic of Trino was avoiding the [[Application checkpointing\|checkpointing]] operations involving expensive writes, used by systems like Hive and Spark. Avoiding these writes may require restarting a query in the rare case of failure during the operation. ~~== Use Cases ==~~ In general, Trino is used for [[OLAP]] scenarios instead of [[OLTP]] uses.<ref>{{cite web \|title=Use cases — Trino 361 Documentation \|url=https://trino.io/docs/361/overview/use-cases.html \|website=trino.io \|access-date=20 September 2021}}</ref> ~~=== Data Lake Query Engine ===~~ Trino was originally created to replace the [[Apache Hive]] runtime while maintaining the ability to query data in [[Apache Hadoop#Hadoop distributed file system\|HDFS]] or [[object storage]]. Many companies use Trino as a query engine to speed up analytics reads from the data lake. ~~=== Federated Query Engine ===~~ Trino can combine data from multiple sources in a single query. Using the [[service provider interface\|SPI]], Trino connectors can query data sources, including files in [[Apache Hadoop#HDFS\|HDFS]], [[Amazon S3]], [[MySQL]], [[PostgreSQL]], [[Microsoft SQL Server]], [[Amazon Redshift]], [[Apache Kudu]], [[Apache Pinot]], [[Apache Kafka]], [[Apache Cassandra]], [[Apache Druid]], [[MongoDB]], [[Elasticsearch]], and [[Redis]]. Unlike [[Apache Impala]] and other prior Hadoop-specific tools, Trino can work with any underlying system. ==See also== * [[Presto (SQL query engine)]] * [[Big data]] * [[Data Intensive Computing]] * [[~~Presto~~Apache ~~(SQL query engine)~~Drill]] * [[Computer cluster]] == References == {{Reflist}}~~<br/>~~ == External links == * [https://trino.io/foundation.html Trino Software Foundation (formerly Presto Software Foundation)] * [https://github.com/prestodb/foundation Presto Foundation] (under the [[Linux Foundation]]) [[:Category:SQL]] [[:Category:Free system software]] [[:Category:Hadoop]] [[:Category:Cloud platforms]] [[:Category:Java platform]]