Graph Query Language: Difference between revisions

Content deleted Content added
Heading changes, sub-headings
Georgiana's thesis, Hannes' FOSDEM slides, refs for G_CORE impls, GSQL closed schema, MG CIP, community WGs, pre-history.
Line 5:
 
The GQL project proposal states{{quote|left=3|right=7|text="Using graph as a fundamental representation for data modeling is an emerging approach in data management. In this approach, the data set is modeled as a graph, representing each data entity as a vertex (also called a node) of the graph and each relationship between two entities as an edge between corresponding vertices. The graph data model has been drawing attention for its unique advantages. Firstly, the graph model can be a natural fit for data sets that have hierarchical, complex, or even arbitrary structures. Such structures can be easily encoded into the graph model as edges. This can be more convenient than the relational model, which requires the normalization of the data set into a set of tables with fixed row types. Secondly, the graph model enables efficient execution of expensive queries or data analytic functions that need to observe multi-hop relationships among data entities, such as reachability queries, shortest or cheapest path queries, or centrality analysis. There are two graph models in current use: the Resource Description Framework (RDF) model and the Property Graph model. The RDF model has been standardized by W3C in a number of specifications. The Property Graph model, on the other hand, has a multitude of implementations in graph databases, graph algorithms, and graph processing facilities. However, a common, standardized query language for property graphs (like SQL for relational database systems) is missing. GQL is proposed to fill this void."<ref name="BSI 39075 GQL">{{cite web|url=https://standardsdevelopment.bsigroup.com/projects/9019-02970|title=ISO/IEC JTC 1/SC 32 N 3007 - ISO/IEC NP 39075 Information Technology -- Database Languages -- GQL|last=|first=|date=|website=|publisher=British Standards Institute|accessdate=September 29, 2019}}</ref>.}}
 
The GQL project is the culmination of converging initiatives dating back to 2016, particularly a private proposal from Neo4j to other database vendors in July 2016<ref name="Creating standard">{{cite web|url=https://s3.amazonaws.com/artifacts.opencypher.org/website/materials/DM32.2/DM32.2-2018-00144.Creating+an+Open+Industry+Standard+for+a+Declarative+Property+Graph+Query+Language.pdf|title=''Creating an Open Industry Standard for a Declarative Property Graph Query Language''|last1=Green|first1=Alastair|date=July 2016|website=|publisher=opencypher.org|accessdate=November 12, 2019}}</ref>, and a proposal from Oracle technical staff within the ISO/IEC JTC 1 standards process later that year<ref name="Towards NWIP">{{cite web|url=https://s3.amazonaws.com/artifacts.opencypher.org/website/materials/DM32.2/DM32.2-2018-00128r1.Working+towards+a+GQL+NWIP.pdf|title=''Working towards a New Work Item for GQL, to complement SQL PGQ''|last1=Green|first1=Alastair|date=July 2018|website=|publisher=opencypher.org|accessdate=November 12, 2019}}</ref>.
 
The GQL project is led by Stefan Plantikow (who was the first lead engineer of Neo4j's Cypher for Apache Spark project) and Stephen Cannan (Technical Corrigenda editor of SQL). They are also the editors of the initial early working drafts<ref name="GQL EWD v2.2">{{cite web|url=https://isotc.iso.org/livelink/livelink?func=ll&objId=20836584&objAction=Open|title=''GQL Early Working Draft v2.2''.|last=Eds. Plantikow|first=Stefan|last2=Cannan|first2=Stephen|date=November 2019|website=|publisher=ISO|accessdate=November 9, 2019}}</ref> of the GQL specification.
 
As originally motivated<ref name="Towards NWIP"/>, the GQL project aims to complement the work of creating an implementable normative natural-language specification with supportive community efforts that enable contributions from those who are unable or uninterested in taking part in the formal process of defining a JTC 1 International Standard<ref name="community">{{cite web|url=https://www.gqlstandards.org/|title=''GQL Standard''|accessdate=November 12, 2019}}</ref><ref name="GCU 3">{{cite web|url=https://www.gqlstandards.org/community-updates|title=''GQL Community Updates''|accessdate=November 12, 2019}}</ref>. In July 2019 the Linked Data Benchmark Council (LDBC) agreed to become the umbrella organization for the efforts of community technical working groups. The Existing Languages and the Property Graph Schema working groups formed in late 2018 and early 2019 respectively. A working group to define formal denotational semantics for GQL was proposed at the third GQL Community Update in October 2019<ref name="FSWG">{{cite web|url=https://drive.google.com/open?id=15DAUAORu477FF-DooTH2ol0SZhx2ARtr|title=''Formal Semantics Working Group''|last=Libkin|first=Leonid|accessdate=November 12, 2019}}</ref>.
 
====The GQL property graph data model====
 
GQL is a query language specifically for property graphs. A property graph closely resembles a conceptual data model, as expressed in an Entity Relationship model or in a UML class diagram (although it does not include n-ary relationships linking more than two entities). Entities or concepts are modelled as nodes, and relationships as edges, in a graph. Property graphs are ''multigraphs'': there can be many edges between the same pair of nodes. GQL graphs can be ''mixed'': they can contain directed edges, where one of the endpoint nodes of an edge is the tail (or source) and the other node is the head (or target or destination), but they can also contain undirected (bidirectional or reflexive) edges.
 
Nodes and edges, collectively known as elements, have attributes. Those attributes may be data values, or labels (tags). Values of properties cannot be elements of graphs, nor can they be whole graphs: these restrictions intentionally force a clean separation between the topology of a graph, and the attributes carrying data values in the context of a graph topology. The property graph data model therefore deliberately prevents nesting of graphs, or treating nodes in one graph as edges in another. Each property graph may have a set of labels and a set of properties that are associated with the graph as a whole.
Line 20 ⟶ 24:
Additional aspects of the ERM or UML models (like generalization or subtyping, or entity or relationship cardinalities) may be captured by GQL schemas or types that describe possible instances of the general data model.
 
==JTC 1/SC32 Working Group 3 (WG3): extendingExtending SQL and creating GQL ==
The GQL project has a four-year timespan. Seven national standards bodies (those of the United States, China, Korea, the Netherlands, the United Kingdom, Denmark and Sweden) have nominated national subject-matter experts to work on the project, which is conducted by Working Group 3 (Database Languages) of ISO/IEC JTC 1's Subcommittee 32 (Data Management and Interchange), usually abbreviated as '''ISO/IEC JTC 1/SC 32 WG3''', or just "'''WG3"''' for short. WG3 (and its direct predecessor committees within JTC 1) has been responsible for the SQL standard since 1987.<ref name="SC32 and WG3 history">{{cite web|url=https://jtc1info.org/sd_2-history_of_jtc1/jtc1-subcommittees/sc-32/|title=JTC 1/SC 32 Data Management and Interchange|last=|first=|date=|website=|publisher=ISO/IEC JTC1|accessdate=October 6, 2019}}</ref><ref name="1987 scope of SQL">{{cite web|url=https://isotc.iso.org/livelink/livelink?func=ll&objId=19733701&objAction=Open/|title=''Scope from the original standard, ISO 9075-1987, Database Language SQL''|last=|first=|date=|website=|publisher=ISO/IEC JTC1|accessdate=November 9, 2019}}</ref>
 
A working proposal for the scope and features of GQL<ref name="ARK028 slides">{{cite web|url=https://drive.google.com/open?id=14nYtuAmN1Ui1b8ppfLH-D0FYTOMi2RnV|title=''High-level Scope and Characteristics of GQL''|last=Neo4 Inc.|last2=TigerGraph Inc.|date=|website=|publisher=gqlstandards.org|accessdate=October 6, 2019}}</ref> was put forward by expert contributors from Sweden, the United Kingdom and the United States in the query languages standards teams at Neo4j Inc. and TigerGraph Inc. in September 2019, following the start of the project. This proposal, along with supplementary material detailing transaction demarcation semantics and syntax, and an approach to transaction isolation that would allow implementers to include isolation levels additional to those in the SQL standard, were agreed at the meeting of WG3 that took place in the same month in Arusha, Tanzania.
Line 32 ⟶ 36:
=== Cypher ===
Cypher<ref name="Cypher">{{cite web|url=https://dl.acm.org/citation.cfm?id=3190657|title=''Cypher: An Evolving Query Language for Property Graphs.'' In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). ACM, New York, NY, USA, 1433-1445. DOI: 10.1145/3183713.3190657|last=Francis|first=Nadime|display-authors=etal|date=|website=|publisher=ACM|accessdate=October 25, 2019}}</ref> is a language originally designed andby firstAndrés implementedTaylor byand colleagues at Neo4j Inc., and first implemented by that company in 2011. Since 2015 it has been made available as an open source language description<ref name="Cypher 9">{{cite web|url=https://s3.amazonaws.com/artifacts.opencypher.org/openCypher9.pdf|title=''Cypher Query Language Reference (Version 9)'' |last=|first=|display-authors=etal|date=|website=|publisher=opencypher.org|accessdate=November 10, 2019}}</ref> with grammar tooling, a JVM front-end that parses Cypher queries, and a Technology Compatibility Kit (TCK) of over 2000 test scenarios, using Cucumber for implementation language portability<ref name="Cypher Resources">{{cite web|url=http://www.opencypher.org/resources|title=''openCypher Resources''|last=|first=|display-authors=etal|date=|website=|publisher=ACM|accessdate=November 10, 2019}}</ref>. The TCK reflects the language description and an enhancement for temporal datatypes and functions documented in a Cypher Improvement Proposal<ref name="Date-Time CIP">{{cite web|url=https://github.com/thobe/openCypher/blob/date-time/cip/1.accepted/CIP2015-08-06-date-time.adoc|title=''CIP2015-08-06 - Date and Time''|last=|first=|date=|website=|publisher=opencypher.org|accessdate=October 25, 2019}}</ref>.
 
Cypher allows creation, reading, updating and deleting of graph elements, and is a language that can therefore be used for analytics engines and transactional databases.
Line 39 ⟶ 43:
Cypher uses compact fixed- and variable-length patterns which combine visual representations of node and relationship (edge) topologies, with label existence and property value predicates. (These patterns are usually referred to as "ASCII Art" patterns, and arose originally as a way of commenting programs which used a lower-level graph API.<ref name="GQLs history"/>) By matching such a pattern against graph data elements, a query can extract references to nodes, relationships and paths of interest. Those references are emitted as a "binding table" where column names are bound to a multiset of graph elements. The name of a column becomes the name of a "binding variable", whose value is a specific graph element reference for each row of the table.
 
For example, a pattern &nbsp;{{code|MATCH (p:Person})-[:LIVES_IN]->(c:City)}}&nbsp; will generate a two-column output table. The first column named &nbsp;{{code|p}}&nbsp; will contain references to nodes with a label &nbsp;{{code|Person}}&nbsp;. The second column named &nbsp;{{code|c}}&nbsp; will contain references to nodes with a label &nbsp;{{code|City}}&nbsp;, denoting the city where the person lives.
 
The binding variables &nbsp;{{code|p}}&nbsp; and &nbsp;{{code|c}}&nbsp; can then be dereferenced to obtain access to property values associated with the elements referred to by a variable. The example query might be terminated with a &nbsp;{{code|RETURN}}, resulting in a complete query like this:
 
<pre>MATCH (p:Person})-[:LIVES_IN]->(c:City)
RETURN p.first_name, p.last_name, c.name, c.state</pre>This would result in a final four-column table listing the names of the residents of the cities stored in the graph.
 
Pattern-based queries are able to express joins, by combining multiple patterns which use the same binding variable to express a natural join using the &nbsp;{{code|MATCH}}&nbsp; clause:
Queries are therefore able to first project a sub-graph of the graph input into the query, and then extract the data values associated with that subgraph. Data values can also be processed by functions, including aggregation functions, leading to the projection of computed values which render the information held in the projected graph in various ways. Patterns of this kind have become pervasive in property graph query languages, and are the basis for the advanced pattern sub-language being defined in SQL/PGQ, which is likely to become a subset of the GQL language.
 
Cypher uses patterns for insertion and modification clauses (CREATE and MERGE), and proposals have been made in the GQL project for collecting node and edge patterns to describe graph types.
<pre>MATCH (p:Person)-[:LIVES_IN]->(c:City), (p:Person)-[:NATIONAL_OF]->(EUCountry)
RETURN p.first_name, p.last_name, c.name, c.state</pre>
This query would return the residential ___location only of EU nationals.
 
An outer join can be expressed by &nbsp;{{code|MATCH ... OPTIONAL MATCH}}&nbsp;:
 
<pre>MATCH (p:Person)-[:LIVES_IN]->(c:City) OPTIONAL MATCH (p:Person)-[:NATIONAL_OF]->(ec:EUCountry)
RETURN p.first_name, p.last_name, c.name, c.state, ec.name</pre>
This query would return the city of residence of each person in the graph with residential information, and, if an EU national, which country they come from.
 
Queries are therefore able to first project a sub-graph of the graph input into the query, and then extract the data values associated with that subgraph. Data values can also be processed by functions, including aggregation functions, leading to the projection of computed values which render the information held in the projected graph in various ways. PatternsFollowing the lead of thisG-CORE kindand haveMorpheus, becomeGQL pervasiveaims into propertyproject graphthe querysub-graphs languages,defined andby arematching thepatterns basis(and forgraphs thethen advancedcomputed patternover those sub-languagegraphs) beingas definednew ingraphs SQL/PGQ,to whichbe isreturned likely to becomeby a subset ofquery. the GQL language.
Patterns of this kind have become pervasive in property graph query languages, and are the basis for the advanced pattern sub-language being defined in SQL/PGQ, which is likely to become a subset of the GQL language. Cypher also uses patterns for insertion and modification clauses (&nbsp;{{code|CREATE}}&nbsp; and &nbsp;{{code|MERGE}}&nbsp;), and proposals have been made in the GQL project for collecting node and edge patterns to describe graph types.
 
====Cypher implementations====
Line 58 ⟶ 75:
 
=== PGQL ===
PGQL<ref name="PGQL Grades">{{cite web|url=https://dl.acm.org/citation.cfm?id=2960421/|title=''PGQL: a property graph query language''. In Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems (GRADES '16). ACM, New York, NY, USA, Article 7, 6 pages. DOI: 10.1145/2960414.2960421|last=van Rest|first=Oskar|display-authors=etal|date=June 2016|website=|publisher=pgql.orgACM|accessdate=October 25, 2019}}</ref>
is a language designed and implemented by Oracle Inc., but made available as an open source specification<ref name="PGQL spec">{{cite web|url=http://pgql-lang.org/|title=PGQL|last=|first=|date=|website=|publisher=pgql.org|accessdate=October 6, 2019}}</ref>, along with JVM parsing software<ref name="PGQL parser">{{cite web|url=https://github.com/oracle/pgql-lang|title=''PGQL is an SQL-based query language for the Property Graph data model''. |last=van Rest|first=Oskar|display-authors=etal|date=September 2015|website=|publisher=pgql.org|accessdate=November 3, 2019}}</ref>. PGQL combines familiar SQL SELECT syntax including SQL expressions and result ordering and aggregation with a pattern matching language very similar to that of Cypher. It allows the specification of the graph to be queried, and includes a facility for macros to capture "pattern views", or named sub-patterns. It does not support insertion or updating operations, having been designed primarily for an analytics environment, such as Oracle's PGX product. PGQL has also been implemented in Oracle Big Data Spatial and Graph, and in a research project, PGX.D/Async<ref name="PGQL PGX.D">{{cite web|url=https://dl.acm.org/citation.cfm?doid=3078447.3078454/|title=''PGX.D/Async: A Scalable Distributed Graph Pattern Matching Engine''. In Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Systems (GRADES'17). ACM, New York, NY, USA, Article 7, 6 pages. DOI: 10.1145/3078447.3078454|last=Roth|first=Nicholas P.|display-authors=etal|date=2017|website=|publisher=ACM|accessdate=October 29, 2019}}</ref>.
 
=== G-CORE===
G-CORE is a research language designed by a group of academic and industrial researchers and language designers which draws on features of Cypher, PGQL and SPARQL<ref name="G-CORE">{{cite web|url=https://dl.acm.org/citation.cfm?id=3190654/|title=''G-CORE: A Core for Future Graph Query Languages. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). ACM, New York, NY, USA, 1421-1432. DOI: 10.1145/3183713.3190654|last=Angles|first=Renzo|display-authors=etal|date=2018|website=|publisher=ACM|accessdate=November 9, 2019}}</ref><ref name="G-CORE summary">{{cite web|url=https://dl.acm.org/citation.cfm?id=3190654/|title=''G-CORE: The LDBC Graph Query Language Proposal. In archives of FOSDEM 2018.|last=Voigt|first=Hannes|date=February 2018|website=|publisher=|accessdate=November 12, 2019}}</ref>. The project was conducted under the auspices of the Linked Data Benchmark Council (LDBC), starting with the formation of a Graph Query Language task force in late 2015, with the bulk of the work of paper writing occurring in 2017. G-CORE is a composable language which is closed over graphs: graph inputs are processed to create a graph output, using graph projections and graph set operations to construct the new graph. G-CORE queries are pure functions over graphs, having no side effects, which mean that the language does not define operations which mutate (update or delete) stored data. G-CORE introduces views (named queries). It also incorporates paths as elements in a graph ("paths as first class citizens"), which can be queried independently of projected paths (which are computed at query time over node and edge elements). G-CORE has been partially implemented in an open-source research projects in the LDBC Github organization<ref name="G-CORE parser">{{cite web|url=https://github.com/ldbc/ldbc_gcore_parser|title=''G-CORE Grammar and Parser''|last=van Rest|first=Oskar|date=2017|website=|publisher=LDBC|accessdate=November 12, 2019}}</ref><ref name="G-CORE project">{{cite web|url=https://homepages.cwi.nl/~boncz/msc/2018-GeorgianaCiocirdel.pdf|title=''A G-CORE (Graph Query Language) Interpreter'', Master’s Thesis in Parallel and Distributed Computer Systems, CWI and Vrije Universiteit Amsterdam|last=Ciocîrdel|first=Georgiana Diana|date=2018|website=|publisher=CWI|accessdate=November 12, 2019}}</ref><ref name="G-CORE interpreter">{{cite web|url=hhttps://github.com/ldbc/gcore-spark|title=''G-CORE interpreter on Spark''|last=Ciocîrdel|first=Georgiana Diana|last2=Boncz|first2=Peter|date=2017|website=|publisher=LDBC|accessdate=November 12, 2019}}</ref>.
 
=== GSQL ===
GSQL<ref name="GSQL white paper">{{cite web|url=https://info.tigergraph.com/gsql|title=''GSQL: An SQL-Inspired Graph Query Language''|last=Wu|first=Mingxi|last2=Deutsch|first2=Alin|date=|website=|publisher=|accessdate=November 9, 2019}}</ref> is a language designed for TigerGraph Inc.'s property graph database. Since October 2018 TigerGraph language designers have been very active in promoting and working on the GQL project. GSQL is a Turing-complete language that incorporates procedural flow control and iteration, and a facility for gathering and modifying computed values associated with a program execution for the whole graph or for elements of a graph called accumulators. These features are designed to enable iterative graph computations to be combined with data exploration and retrieval. GSQL graphs must be described by a schema of vertexes and edges, which constrains all insertions and updates. This schema therefore has the closed world property of an SQL schema, and this aspect of GSQL (also reflected in design proposals deriving from the Morpheus project<ref name="PGS">{{cite web|url=https://s3.amazonaws.com/artifacts.opencypher.org/website/materials/sql-pg-2018-0056r1-Property-Graph-Schema.pdf/|title=''Property Graph Schema'', ANSI INCITS DM32.2 SQL Property Graph Extensions Ad Hoc submission ''sql-pg-2018-0056r1'', Neo4j Query Languages Standards and Research Team|last=Voigt|first=Hannes|last2=Selmer|first2=Petra|last3=Lindaaker|first3=Tobias|last4=Plantikow|first4=Stefan|last5=Green|first5=Alastair|last6=Furniss|first6=Peter|date=December 2018|website=|publisher=openCypher.org|accessdate=November 12, 2019}}</ref>) is proposed as an important optional feature of GQL.

Vertexes and edges are are named schema objects which contain data but also define an imputed type, much as SQL tables are data containers, with an associated implicit row type. GSQL graphs are then composed from these vertex and edge sets, and multiple named graphs can include the same vertex or edge set. GSQL has developed new features since its release in September 2017<ref name="GSQL 1.0">{{cite web|url=https://doc-archive.tigergraph.com/1.0/GSQL-Language-Reference-Part-1---Defining-Graphs-and-Loading-Data.html|title=''GSQL documentation Tigergraph 1.0''.|date=2017|website=|publisher=|accessdate=November 9, 2019}}</ref>, most notably introducing variable-length edge pattern matching<ref name="GSQL patterns">{{cite web|url=https://docs.tigergraph.com/v/2.4/release-notes-change-log/release-notes-tigergraph-2.4|title=''Pattern Matching'', TigerGraph 2.4 Release Notes.|date=June 2019|website=|publisher=|accessdate=November 9, 2019}}</ref> using a syntax related to that seen in Cypher, PGQL, SQL/PGQ, but also close in style to the fixed-length patterns offered by Microsoft SQL/Server Graph<ref name="SQLServer Graph">{{cite web|url=https://docs.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-overview?view=sql-server-ver15#query-language-extensions|title=''Query language extensions'', Graph processing with SQL Server and Azure SQL Database|last=|first=|display-authors=etal|date=2017|website=|publisher=Microsoft Inc.|accessdate=November 10, 2019}}</ref>.
 
=== MultipleMorpheus: multiple graphs and composable graph queries in Cypher for Apache Spark ===
The opencypher Morpheus project<ref name="CAPS Morpheus"/> implements Cypher for Apache Spark users. Commencing in 2016, this project originally ran alongside twothree related efforts, in which Morpheus designers also took part: SQL/PGQ and, G-CORE and design of Cypher extensions for querying and constructing multiple graphs<ref name="MGCQ CIP">{{cite web|url=https://github.com/boggle/openCypher/blob/CIP2017-06-18-multiple-graphs/cip/1.accepted/CIP2017-06-18-multiple-graphs.adoc|title=''CIP2017-06-18 Querying and constructing multiple graphs''|last1=Taylor|first1=Andrés|last2=Plantikow|first2=Stefan|last3=Selmer|first3=Petra|date=2018|website=|publisher=opencypher.org|accessdate=November 12, 2019}}</ref>. The Morpheus project acted as a testbed for extensions to Cypher (known as "Cypher 10") in the two areas of graph DDL and query language extensions.
 
Graph DDL features include<ref name="CAPS oCIM V">{{cite web|url=https://s3.amazonaws.com/artifacts.opencypher.org/website/ocim5/slides/ocim5+-+CAPS+%2B+GraphDDL.pdf|title=''Multiple graphs and composable queries in Cypher for Apache Spark''. openCypher Implementers Meeting V, Berlin.|last=Kiessling|first=Max|date=2019|website=|publisher=opencypher.org|accessdate=November 9, 2019}}</ref>
Graph DDL features include
#definition of property graph views over JDBC-connected SQL tables and Spark DataFrames<ref name="CAPS oCIM V">{{cite web|url=https://s3.amazonaws.com/artifacts.opencypher.org/website/ocim5/slides/ocim5+-+CAPS+%2B+GraphDDL.pdf|title=''Multiple graphs and composable queries in Cypher for Apache Spark''. openCypher Implementers Meeting V, Berlin.|last=Kiessling|first=Max|date=2019|website=|publisher=opencypher.org|accessdate=November 9, 2019}}</ref><ref name="EXTENDS element type">{{cite web|url=https://github.com/tobias-johansson/graphddl-example-ldbc/|title=''graphddl-example-ldbc: A cypher-for-apache-spark example showing the use of SqlPropertyGraphSource and GraphDDL to provide a property graph view of a SQL dataset''.|last=Johanssen|first=Tobias|display-authors=etal|date=2019|website=|publisher=|accessdate=November 9, 2019}}</ref>
#definition of graph schemas or types defined by assembling node type and edge type patterns, with subtyping<ref name="EXTENDS element type"/>
#constraining the content of a graph by a closed or fixed schema
#creating catalog entries for multiple named graphs in a hierarchically-organized catalog<ref name="CAPS oCIM V"/>
#graph data sources to form a federated, heterogeneous catalog
#creating catalog entries for named queries (views)
Graph query language extensions include.<ref name="CAPS oCIM V"/>
#graph union
#projection of graphs computed from the results of pattern matches on multiple input graphs