Druid is a column-oriented open-source distributed data store written in Java. Druid is designed to quickly ingest massive quantities of time-series data, making that data immediately available to queries.[1] This is sometimes referred to as real-time data.
Druid | |
---|---|
Original author(s) | Eric Tschetter, Fangjin Yang |
Developer(s) | The Druid Community |
Stable release | 0.6.121
/ 18 June 2014 |
Repository | |
Written in | Java |
Operating system | Cross-platform |
Type | distributed, real-time, column-oriented data store |
License | GNU General Public License v2 |
Website | druid |
On the developer Q&A site Stackoverflow, Druid is described as "open-source infrastructure for real-time exploratory analytics on large datasets."[2] It is designed to ingest time-series data, chunking and compressing that data into column-based queryable segments.[3]
Architecture[4]
Fully deployed, Druid runs as a cluster of specialized nodes to support a fault-tolerant architecture where data is stored redundantly and there are multiple members of each node type.[5] In addition, the cluster includes external dependencies for coordination (Apache ZooKeeper), storage of metadata (Mysql), and a deep storage facility (e.g., HDFS, Amazon S3, or Apache Cassandra).
Data Ingestion
Data is ingested by Druid directly through its real-time nodes, or batch-loaded into historical nodes from a deep storage facility. Real-time nodes accept JSON-formatted data from a streaming datasource. Batch-loaded data formats can be JSON, CSV, or TSV. Real-time nodes temporarily store and serve data in real time, but eventually push the data to the deep storage facility, from which it is loaded into historical nodes. Historical nodes hold the bulk of data in the cluster.
Real-time nodes chunk data into segments, and are designed to frequently move these segments out to deep storage. To maintain cluster awareness of the ___location of data, these nodes must interact with Mysql to update metadata about the segments, and with Apache ZooKeeper to monitor their transfer.
Query Management
Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.
Cluster Management
Operations relating to data management in historical nodes are overseen by coordinator nodes, which are the prime users of the Mysql metadata tables. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.
History
Druid was created by real-time analytics company Metamarkets to use as a major part of its backend. The company open-sourced Druid in late 2012.[6] Since then, a number of organizations and companies, including Netflix[7] and Yahoo[8] have integrated Druid into their backend technology.
References
- ^ Hemsoth, Nicole. "Druid Summons Strength in Real-Time", datanami, 08 November 2012
- ^ Stackoverflow shorthand tag description
- ^ Monash, Curt. "Metamarkets Druid Overview", DBMS2, 16 June 2012
- ^ Druid Project Documentation
- ^ Yang, Fangjin; Tschetter, Eric; Léauté, Xavier; Ray, Nelson; Merlino, Gian; Ganguli, Deep. "Druid: A Real-time Analytical Data Store", Metamarkets, retrieved 6 February 2014
- ^ Higginbotham, Stacey. "Metamarkets open sources Druid, its in-memory database", GigaOM, 24 October 2012
- ^ Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir. "Announcing Suro: Backbone of Netflix's Data Pipeline", Netflix, 9 December 2013
- ^ Iranmanesh, Reza; Chandrashekar, Srikalyan. "Pushing the limits of Realtime Analytics using Druid", Slideshare, 19 July 2004