Wikidata/Development/RDF

This is an archived version of this page, as edited by Denny Vrandečić (WMDE) (talk | contribs) at 20:32, 3 August 2012 (To do). It may differ significantly from the current version.

This note both motivates and specifies the RDF export of Wikidata.

General notes

The more you are familiar with the following topics the easier will it be for you to understand this note. If you cannot follow a part, feel free to use the references given here.

Furthermore, throughout this note we will not use the URIs that will be used in the system later on, but rather suggestive and convenient URIs that help to follow the argumentation in this note. There exists a separate note on the Wikidata URI scheme. The namespaces used in this note are listed at the bottom of the page.

Motivation

Example

For the following discussion, we introduce two statements with the same item and property, the first with two, the second with one qualifier:

Berlin


Population3,499,879[no sources]
as of November 30, 2012
Method Extrapolation
8,000[1 source]
as of 15th century

Ground triples

The simplest possible way to represent the first statement would be the following triple:

w:Berlin p:Population "3499879"^^xsd:int .

We will call this triple the ground triple of the first example statement. The ground triple completely omits the qualifiers, which in case of the first example does not seem too bad. But the second statement is giving the population of Berlin as 8,000 in the 15th century. This would lead to the following ground triple:

w:Berlin p:Population "8000"^^xsd:int .

If this ground triple was given too, Berlin would be in the result set when querying for cities with less than 10,000 inhabitants, which does not sound desirable.

Statements with qualifiers

Instead, we need to represent the statement including the qualifiers. Recommendations on how to represent this kind of data is given by the Semantic Web Best Practice note on Defining N-ary Relations on the Semantic Web. In the following we follow their recommendation.

We have to introduce an intermediary node to represent the statement as such, and connect it to the item and the value. The following triples represent the claim from the first statement.

w:Berlin s:Population Berlin:Statement1 .
Berlin:Statement1 rdf:type o:Statement .
Berlin:Statement1 v:Population "3499879"^^xsd:int .
Berlin:Statement1 q:As_of "30-11-2011"^^xsd:date .
Berlin:Statement1 q:Method w:Extrapolation .
Berlin:Statement1 rdf:label "3,499,879 (As of Nov 30, 2011, Method Extrapolation)"^en .

Note that we introduced two new properties, s:Population and v:Population instead of the original p:Population property (mind the namespaces). The original property, in the p: namespace, is a datatype property that connects items directly with an integer value (i.e. the ___domain is item). The s: property, on the other hand, is an object property connecting the item with the statement about it.

For the same reason we introduced the q: properties for qualifiers, connecting a statement node with the value of the qualifier.

A user of the data could easily get the ground triple, either by adding an OWL2 axiom stating that p:Population is derived from the property chain of s:Population and v:Population, or by using a SPARQL construct query with the same effect.

One additional feature in order to support Semantic Web browsers is to give the statement a label that corresponds to a human-serialized form of the value including the qualifiers. The label would be exported in all available languages. This enables Semantic Web browsers to display the first triple in the above serialization in a useful way for the viewer of the data. We tested it with Marbles and the OpenLink Date Explorer.

In order to see how none and unknown are represented, please refer to the specification below.

Statements with references

Now that we have a node for the statement, it becomes trivial to add a reference.

Berlin:Statement1 o:Reference Berlin:Statement1-Reference1 .

We will investigate if we can reuse an appropriate property from the provenance ontology here. Also we will further expand on how the references are modeled as soon as this is progressed further.

Representation of values

Note that the values might and usually are also further structured themselves. We will detail how the values are exported as we progress with specifying them further within Wikidata. Since values might often have a unit or an accuracy associated with them, they will often be represented as yet another intermediate node with the respective values attached to it. For our examples we only deal with the datatype entity.

Specification

The specification is defined as a mapping of Wikidata statements written in the Wikidata Object Notation to OWL2 axioms written in Functional-style syntax. The transformation of OWL2 axioms to RDF triples in turn is defined by the OWL2 standard, the result is given here for convenience as well. The current specification is slightly simplified for now as it omits PropertyIntervalSnak, PropertySomeInertvalSnak, PropertyInstanceOfSnak, PropertySubclassOfSnak, and Rank.

The following is a reiteration of the relevant part of the Wikidata Object Notation.

ItemDescription :=  'ItemDescription(' Item {Statement} ')'
Statement :=  'Statement(' MainSnak {Qualifier} {ReferenceRecord} ')'

Every statement is translated into a number of OWL2 axioms as described below. Every statement is identified by a StatementID which is a IRI.

MainSnak a PropertyValueSnak

If the MainSnak is a PropertyValueSnak, then it is translated as follows:

ObjectPropertyAssertion( s:Property Item StatementID )
ClassAssertion( o:Statement StatementID )
ObjectPropertyAssertion( v:Property StatementID Value )
Annotation( StatementID rdfs:label ValueLabel(Statement) )

s:Property is an IRI that has the same local name as Property but replaces the p: namespace with the s: namespace. The same is defined for v:Property and q:Property respectively. The function ValueLabel returns an appropriate label describing the value of the statement including the qualifiers (not defined here).

In RDF this results in:

Item s:Property StatementID .
StatementID rdf:type o:Statement .
StatementID v:Property Value .
StatementID rdfs:label ValueLabel(Statement) .

MainSnak a PropertySomeValueSnak

If the MainSnak is a PropertySomeValueSnak, then it is translated as follows:

ObjectPropertyAssertion( s:Property Item StatementID )
ClassAssertion( o:Statement StatementID )
ClassAssertion( ObjectSomeValuesFrom( v:Property owl:Thing ) StatementID )
Annotation( StatementID rdfs:label ValueLabel(Statement) )

In RDF this results in:

Item s:Property StatementID .
StatementID rdf:type o:Statement .
StatementID rdf:type _:1 .
_:1 rdf:type owl:Restriction .
_:1 owl:onProperty v:Property .
_:1 owl:someValuesFrom owl:Thing .
StatementID rdfs:label ValueLabel(Statement) .

MainSnak a PropertyNoValueSnak

If the MainSnak is a PropertyNoValueSnak, then it is translated as follows:

ObjectPropertyAssertion( s:Property Item StatementID )
ClassAssertion( o:Statement StatementID )
ClassAssertion( ObjectAllValuesFrom( v:Property owl:Nothing ) StatementID )
Annotation( StatementID rdfs:label ValueLabel(Statement) )

In RDF this results in:

Item s:Property StatementID .
StatementID rdf:type o:Statement .
StatementID rdf:type _:1 .
_:1 rdf:type owl:Restriction .
_:1 owl:onProperty v:Property .
_:1 owl:allValuesFrom owl:Nothing .
StatementID rdfs:label ValueLabel(Statement) .

Qualifier

Each Qualifier is translated as follows. If the Qualifier is a PropertyValueSnak, then it is translated as follows:

ObjectPropertyAssertion( q:Property StatementID Value )

In RDF this results in:

StatementID q:Property Value .

If the Qualifier is a PropertySomeValueSnak, then it is translated as follows:

ClassAssertion( ObjectSomeValuesFrom( q:Property owl:Thing ) StatementID )

In RDF this results in:

StatementID rdf:type _:1 .
_:1 rdf:type owl:Restriction .
_:1 owl:onProperty v:Property .
_:1 owl:someValuesFrom owl:Thing .

If the Qualifier is a PropertyNoValueSnak, then it is translated as follows:

ClassAssertion( ObjectAllValuesFrom( v:Property owl:Nothing ) StatementID )

In RDF this results in:

StatementID rdf:type _:1 .
_:1 rdf:type owl:Restriction .
_:1 owl:onProperty v:Property .
_:1 owl:allValuesFrom owl:Nothing .

ReferenceRecord

Every ReferenceRecord is given a ReferenceID which is an IRI. Every ReferenceRecord is translated as follows:

ObjectPropertyAssertion( o:Reference StatementID ReferenceID )

In RDF this results in:

StatementID o:Reference ReferenceID .

Discussion of alternatives

The following discussion is for giving rationales to the design decisions in this note and can be skipped if the reader is not interested in them.

Punning for the property names

Instead of having different properties in the p:, s:, v: and q: namespaces, turning every property in Wikidata to four in the RDF export, we could have used a single property and just use it in all the four use cases. In almost all cases it would still be clear which property is actually used: p: connects the item with the value, s: the item with the statement, v: the statement with the value, and q: is used for qualifier, connecting the statement with the qualifier value. The only case of ambiguity would be between the v: and q: as they both connect the statement with a value, and it leads to an ambiguity if a qualifier with the same property is used as in the ground triple.

Punning is in general frowned upon on the Semantic Web, but it is not unheard of, e.g. in OWL2 individuals and classes can be punned in general. Also a recent proposal for the notorious HTTP-range14 discussion proposes to use punning as a solution.

We decided to not use punning in our case, but rather to accept the proliferation of properties. The reason for that is that it is not only best practice, but also necessary for the OWL2 DL serialization to be valid OWL as we need to make a clear difference between object and datatype properties.

We also wanted to ensure that the vocabulary developed within Wikidata will be reusable outside of Wikidata. As in most cases we expect external data publishers to use the direct representation of data with triples—i.e. just the ground triple—we also wanted to ensure that such a property is available for external reuse. Ironically this property is not the one we use in the Wikidata export itself, but there are well defined relationships between them, as described above.

Named graphs

Adding a reference to a claim can be done in three ways:

  1. put every claim in one file and then add provenance metadata about that file
  2. put every claim in a named graph and then add provenance metadata about that graph
  3. reify every claim so that we can add provenance metadata directly to the claim

Ad 1: Having one file for every claim would lead to many files. Even if you are conservative and expect about only ten statements per item, resolving that item would require ten or more HTTP requests. This is prohibitively slow, especially considering that they are all very small files that we are requesting. It also would lead to an unreasonable amount of load on the server, something MediaWiki is not very well equipped for (a server based on Node.js or Twisted would probably be much better equipped for that, though).

Ad 2: The file holding all statements about an item could also contain a named graph for every single claim and then add metdata about these graphs, like the references. Whereas this could be a solution, there is, as of time of writing, no standard for the serialization of named graphs in files. There are a number of contenders (TriX, TriG, NQuads, etc.), but none of them is even on the way of becoming a standard.

Ad 3: See the section on reification below.

Quads

Quads are a serialization for RDF that add a name (IRI) for every triple (the fourth value), or alternatively a context (i.e. it groups triples into sets). There is again no standard serialization for quads, and also no standard semantics.

Reification

Reification as per RDF standard is widely regarded as bloated and disliked. RDF introduces its own reification syntax, which has never really caught on. Due to its widely negative reputation, and due to discussions about deprecating reification from RDF, we decided against using this mechanism.

Publish the ground triple

One alternative decision regards the publication of the ground triple. We decided not to publish it, in order to be more consistent through the RDF serialization of our data model. It avoids publishing a triple like stating the population of Berlin at 8,000 as per above example.

One might say to publish unqualified statements with the ground triple at least, and not to do so for qualified statements. Again we decided against it: first, we would need to represent the statement anyhow in order to publish the reference. Second, this would mean that we would publish potentially conflicting ground triples in the same file - if there are two different sources for two different statements. By publishing everything on the level of statements only, we can remain consistent throughout the dataset by always remaining on the level of statements, taking the role of Wikidata as a secondary database serious.

To do

  • Representation of labels, descriptions and sitelinks
  • Representation of data values
  • Representation of references
  • Representation of rank
  • Representation of the following Snaks: InstanceOf, SubclassOf, PropertyInterval, and PropertySomeInterval

Namespaces

  • w: for Wikidata items
  • o: for the Wikidata ontology (a fixed and small set of terms)
  • p: for Wikidata properties
  • q: for properties used as qualifiers
  • s: for properties used to connect items and statements
  • v: for properties used to connect statements and values
  • Berlin: used as a shortcut for w:Berlin but defined as a prefix
  • xsd:, rdf:, rdfs:, and owl: with their usual meanings