Apache Druid

This is an old revision of this page, as edited by Textractor (talk | contribs) at 18:11, 1 July 2014 (ArchitectureDruid Project Documentation: Added new architecture diagram). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Druid is a column-oriented open-source distributed data store written in Java. Druid is designed to quickly ingest massive quantities of time-series data, making that data immediately available to queries.[1] This is sometimes referred to as real-time data.

Druid
Original author(s)Eric Tschetter, Fangjin Yang
Developer(s)The Druid Community
Stable release
0.6.121 / 18 June 2014 (2014-06-18)
Repository
Written inJava
Operating systemCross-platform
Typedistributed, real-time, column-oriented data store
LicenseGNU General Public License v2
Websitedruid.io

On the developer Q&A site Stackoverflow, Druid is described as "open-source infrastructure for real-time exploratory analytics on large datasets."[2] It is designed to ingest time-series data, chunking and compressing that data into column-based queryable segments.[3]

Architecture[4]

Druid Cluster
Druid Cluster

Fully deployed, Druid runs as a cluster of specialized nodes to support a fault-tolerant architecture where data is stored redundantly and there are multiple members of each node type.[5] In addition, the cluster includes external dependencies for coordination (Apache ZooKeeper), storage of metadata (Mysql), and a deep storage facility (e.g., HDFS, Amazon S3, or Apache Cassandra).

Data Ingestion

Data is ingested by Druid directly through its real-time nodes, or batch-loaded into historical nodes from a deep storage facility. Real-time nodes accept JSON-formatted data from a streaming datasource. Batch-loaded data formats can be JSON, CSV, or TSV. Real-time nodes temporarily store and serve data in real time, but eventually push the data to the deep storage facility, from which it is loaded into historical nodes. Historical nodes hold the bulk of data in the cluster.

Real-time nodes chunk data into segments, and are designed to frequently move these segments out to deep storage. To maintain cluster awareness of the ___location of data, these nodes must interact with Mysql to update metadata about the segments, and with Apache ZooKeeper to monitor their transfer.

Query Management

Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.

Cluster Management

Operations relating to data management in historical nodes are overseen by coordinator nodes, which are the prime users of the Mysql metadata tables. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.

History

Druid was created by real-time analytics company Metamarkets to use as a major part of its backend. The company open-sourced Druid in late 2012.[6] Since then, a number of organizations and companies, including Netflix,[7] have integrated Druid into their backend technology.

References

  1. ^ Hemsoth, Nicole. "Druid Summons Strength in Real-Time", datanami, 08 November 2012
  2. ^ Stackoverflow shorthand tag description
  3. ^ Monash, Curt. "Metamarkets Druid Overview", DBMS2, 16 June 2012
  4. ^ Druid Project Documentation
  5. ^ Yang, Fangjin; Tschetter, Eric; Léauté, Xavier; Ray, Nelson; Merlino, Gian; Ganguli, Deep. "Druid: A Real-time Analytical Data Store", Metamarkets, retrieved 6 February 2014
  6. ^ Higginbotham, Stacey. "Metamarkets open sources Druid, its in-memory database", GigaOM, 24 October 2012
  7. ^ Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir. "Announcing Suro: Backbone of Netflix's Data Pipeline", Netflix, 9 December 2013