Trino (SQL query engine): Difference between revisions

Content deleted Content added
No edit summary
FrescoBot (talk | contribs)
m Bot: link syntax and minor changes
Line 16:
}}
 
'''Trino''' is an [[Open-source software|open-source]] distributed [[SQL]] query engine designed to query large data sets distributed over one or more heterogeneous data sources<ref>{{cite web |title=Overview — Trino 361 Documentation |url=https://trino.io/docs/361/overview.html |website=trino.io |access-date=20 September 2021}}</ref>. Trino is commonly used as a query engine over [[Data_lake|datalakes]] and [[Data Warehouse|data warehouses]] using the [[Hive]] and [[List of Apache Software Foundation projects#Active projects|Iceberg]]<ref name="iceberg">{{cite web |title=About - Apache Iceberg |url=http://iceberg.apache.org/ |website=iceberg.apache.org |access-date=18 September 2021}}</ref> table formats. In these configurations Trino queries can query data in [[Free and open source software|open]] [[Column-oriented DBMS|column-oriented]] data file formats like [[Apache ORC|ORC]] or [[Apache Parquet|Parquet]] residing on different storage systems like [[Apache Hadoop#Hadoop distributed file system|HDFS]], [[Amazon S3|AWS S3]], [[Google_Cloud_Storage|Google Cloud Storage]], or [[Microsoft Azure#Storage services|Azure Blob Storage]]. Trino also has the ability to run federated queries across multiple disparate data sources such as [[MySQL]], [[PostgreSQL]], [[Apache Cassandra|Cassandra]], [[Apache Kafka|Kafka]], [[MongoDB]] and [[Elasticsearch]]. Trino is community driven and released under the [[Apache License]].
 
== History ==
Trino was originally designed and developed by Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang at [[Facebook]] to allow data analysts to run interactive queries on its large [[data warehouse]] in [[Apache Hadoop]]. The project was originally named [[Presto (SQL query engine)|Presto]] and shares the first six years of development with the Presto project<ref>{{cite web |title=Contributors to trinodb/trino |url=https://github.com/trinodb/trino/graphs/contributors?from=2012-08-05&to=2018-08-05&type=c |website=GitHub |access-date=20 September 2021 |language=en}}</ref><ref>{{cite web |title=Contributors to prestodb/presto |url=https://github.com/prestodb/presto/graphs/contributors?from=2012-08-05&to=2018-08-05&type=c |website=GitHub |access-date=20 September 2021 |language=en}}</ref>. Before Presto, data analysts at Facebook relied on [[Apache Hive]], which was too slow for running interctive SQL analytics on their 250 petabyte data warehouse<ref name="2013facebook">{{Cite news|url=http://www.computerworld.com/article/2485668/business-intelligence/facebook-goes-open-source-with-query-engine-for-big-data.html|title=Facebook goes open source with query engine for big data|author=Joab Jackson|date=November 6, 2013|work=Computer World|access-date=April 26, 2017}}</ref>.
 
Martin, Dain, David, and Eric began development in 2012 and they deployed an initial version later that year. Later, Facebook announced its release as open source late Fall of 2013<ref name="2013facebook" /><ref name="2013facebook2">{{Cite news|url=https://gigaom.com/2013/06/06/facebook-unveils-presto-engine-for-querying-250-pb-data-warehouse/|title=Facebook unveils Presto engine for querying 250 PB data warehouse|author=Jordan Novet|date=June 6, 2013|work=Giga Om|access-date=April 26, 2017}}</ref>. As Presto gained popularity, many well known companies, such as [[Netflix]] <ref>{{Cite news|url=http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html|title=Using Presto in our Big Data Platform on AWS|authors=Eva Tse, Zhenxiao Luo, Nezih Yigitbasi|date=October 7, 2014|work=Netflix technical blog|access-date=April 26, 2017}}</ref>, [[AirBnB]] <ref>{{cite web |title=Airpal: a Web UI for PrestoDB |url=https://medium.com/airbnb-engineering/airpal-a-web-based-query-execution-tool-for-data-analysis-33c43265ed1f |website=Medium |access-date=20 September 2021 |language=en |date=4 April 2016}}</ref>, among others, disclosed they used Presto in both on premise and cloud deployments at equivalent petabyte scales. In late 2016, Amazon released that it would provide Presto as a service called Athena <ref>{{cite web |title=AWS Launches Amazon Athena {{!}} Amazon.com, Inc. - Press Room |url=https://press.aboutamazon.com/news-releases/news-release-details/aws-launches-amazon-athena |website=press.aboutamazon.com |access-date=20 September 2021 |language=en}}</ref>.
 
In late 2018, a disagreement around the stewardship of Presto between the founders and Facebook formed as Facebook management pushed to have tighter control over the project. This move included giving automatic committership rights to Facebook developers without prior experience with the project. Shortly after Facebook management moved forward with these changes, the creators left the original Presto project to create a fork.<ref name="2020rename">{{cite web |last1=Traverso |first1=Martin |last2=Sundstrom |first2=Dain |last3=Phillips |first3=David |title=We’re rebranding PrestoSQL as Trino |url=https://trino.io/blog/2020/12/27/announcing-trino.html |website=trino.io |access-date=7 September 2021 |language=en |date=27 December 2020}}</ref> This fork was also initially named Presto, so to differentiate them, users called the original project PrestoDB and the fork PrestoSQL named after their respective web addresses, https://prestodb.io and [https://trino.io https://prestosql.io]. It is worth noting that this split has striking similarities to the [[Jenkins (software)#History|Jenkins and Hudson split]].
Line 29:
In September 2019, Facebook donated PrestoDB to the [[Linux Foundation]] establishing the Presto Foundation.<ref>{{Cite web|url=https://www.linuxfoundation.org/press-release/2019/09/facebook-uber-twitter-and-alibaba-form-presto-foundation-to-tackle-distributed-data-processing-at-scale/|title=Facebook, Uber, Twitter and Alibaba form Presto Foundation to Tackle Distributed Data Processing at Scale|access-date=2019-11-12}}</ref> Neither the creators of Presto, nor the top contributors and committers, were invited to join this foundation.<ref>{{Cite news|url=https://github.com/trinodb/trino/issues/380#issuecomment-557691046|title=What's the relationship between prestosql and prestodb?|date=2019-11-22}}</ref><ref name="2020rename"/>
 
In December 2020, PrestoSQL was rebranded as Trino. <ref name="2020rename"/>
 
== Architecture ==
Line 45:
Trino supports separation of compute and storage and may be deployed both on premises and in the [[Cloud computing|cloud]].
 
Trino has a [[Distributed computing|distributed]] [[massively parallel|MPP]] architecture, which was a big departure from the map reduce design used by most popular data lake systems like Hive, Impala, and [[Apache Spark]]. Trino first distributes work over multiple workers by running ad-hoc partitioning operations or relying on existing partitions in the data of the underlying data store. Once this data has reached the worker, the data is processed over pipelined operators carried out on multiple threads. Another decided characteristic of Trino was avoiding the [[Application checkpointing|checkpointing]] operations involving expensive writes, used by systems like Hive and Spark. This leaves queries vulnerable to needing to be restarted if there is a failure. In practice, this is not reported to happen too often.
 
== Use Cases ==
Line 53:
=== Data Lake Query Engine ===
 
Trino was originally created to replace the [[Apache Hive]] runtime while maintaining the ability to query data in [[Apache Hadoop#Hadoop distributed file system|HDFS]] or [[Object storage|object storage]]. Many companies use Trino as a query engine to speed up analytics reads from the data lake.
 
=== Federated Query Engine ===