Wikidata/Development/RDF

This note both motivates and specifies the RDF export of Wikidata.

General notes

The more you are familiar with the following topics the easier will it be for you to understand this note. If you cannot follow a part, feel free to use the references given here.

RDF as the basic export model
OWL for some constructs
Turtle as a syntax
Linked data principles
Wikidata data model or at least the Wikidata data model primer

Furthermore, throughout this note we will not use the URIs that will be used in the system later on, but rather suggestive and convenient URIs that help to follow the argumentation in this note. There exists a separate note on the Wikidata URI scheme. The namespaces used in this note are listed at the bottom of the page.

Motivation

Example

For the following discussion, we introduce two statements with the same item and property, the first with two, the second with one qualifier:

Berlin

Population	3,499,879	[no sources]
	as of November 30, 2012
	Method Extrapolation

	8,000	[1 source]
	as of 15th century

Ground triples

The simplest possible way to represent the first statement would be the following triple:

w:Berlin p:Population "3499879"^^xsd:int .

We will call this triple the ground triple of the first example statement. The ground triple completely omits the qualifiers, which in case of the first example does not seem too bad. But the second statement is giving the population of Berlin as 8,000 in the 15th century. This would lead to the following ground triple:

w:Berlin p:Population "8000"^^xsd:int .

If this ground triple was given too, Berlin would be in the result set when querying for cities with less than 10,000 inhabitants, which does not sound desirable.

Statements with qualifiers

Instead, we need to represent the statement including the qualifiers. Recommendations on how to represent this kind of data is given by the Semantic Web Best Practice note on Defining N-ary Relations on the Semantic Web. In the following we follow their recommendation.

We have to introduce an intermediary node to represent the statement as such, and connect it to the item and the value. The following triples represent the claim from the first statement.

w:Berlin s:Population Berlin:Statement1 .
Berlin:Statement1 rdf:type o:Statement .
Berlin:Statement1 v:Population "3499879"^^xsd:int .
Berlin:Statement1 q:As_of "30-11-2011"^^xsd:date .
Berlin:Statement1 q:Method w:Extrapolation .
Berlin:Statement1 rdf:label "3,499,879 (As of Nov 30, 2011, Method Extrapolation)"^en .

Note that we introduced two new properties, s:Population and v:Population instead of the original p:Population property (mind the namespaces). The original property, in the p: namespace, is a datatype property that connects items directly with an integer value (i.e. the ___domain is item). The s: property, on the other hand, is an object property connecting the item with the statement about it.

For the same reason we introduced the q: properties for qualifiers, connecting a statement node with the value of the qualifier.

A user of the data could easily get the ground triple, either by adding an OWL2 axiom stating that states that p:Population is derived from the property chain of s:Population and v:Population, or by using a SPARQL construct query with the same effect.

One additional feature in order to support Semantic Web browsers is to give the statement a label that corresponds to a human-serialized form of the value including the qualifiers. The label would be exported in all available languages. This enables Semantic Web browsers to display the first triple in the above serialization in a useful way for the viewer of the data. We tested it with Marbles and the OpenLink Date Explorer.

In order to see how none and unknown are represented, please refer to the specification below.

Statements with references

Now that we have a node for the statement, it becomes trivial to add a reference.

Berlin:Statement1 o:Reference Berlin:Statement1-Reference1 .

We will investigate if we can reuse an appropriate property from the provenance ontology here. Also we will further expand on how the references are modeled as soon as this is progressed further.

Representation of values

Note that the values might and usually are also further structured themselves. We will detail how the values are exported as we progress with specifying them further within Wikidata. Since values might often have a unit or an accuracy associated with them, they will often be represented as yet another intermediate node with the respective values attached to it. For our examples we only deal with the datatype entity.

Specification

The specification is defined as a mapping of Wikidata statements written in the Wikidata Object Notation to OWL2 axioms written in Functional-style syntax. The transformation of OWL2 axioms to RDF triples is defined by the OWL2 standard (and for convenience, the result is given here). Here are the grammar rules for item descriptions:

ItemDescription := 'ItemDescription(' Item Statement* ')'

Discussion of alternatives

The following discussion is for giving rationales to the design decisions in this note and can be skipped if the reader is not interested in them.

Punning for the property names

Instead of having different properties in the p:, s:, v: and q: namespaces, turning every property in Wikidata to four in the RDF export, we could have used a single property and just use it in all the four use cases. In almost all cases it would still be clear which property is actually used: p: connects the item with the value, s: the item with the statement, v: the statement with the value, and q: is used for qualifier, connecting the statement with the qualifier value. The only case of ambiguity would be between the v: and q: as they both connect the statement with a value, and it leads to an ambiguity if a qualifier with the same property is used as in the ground triple.

Punning is in general frowned upon on the Semantic Web, but it is not unheard of, e.g in OWL2 individuals and classes can be punned in general. Also a recent proposal for the notorious HTTP-range14 discussion proposes to use punning as a solution.

We decided to not use punning in our case, but rather to accept the proliferation of properties. The reason for that is that it is not only best practice, but also necessary for the OWL2 DL serialization to be valid OWL as we need to make a clear difference between object and datatype properties.

We also wanted to ensure that the vocabulary developed within Wikidata will be reusable outside of Wikidata. As in most cases we expect external data publishers to use the direct representation of data with triples -- i.e. just the ground triple -- we also wanted to ensure that such a property is avaiable for external reuse. Ironically this property is not the one we use in the Wikidata export itself, but there are well defined relationships between them, as described above.

Named graphs

Giving a claim a reference can be done in three ways:

put every claim in one file and then add provenance metadata about that file
put every claim in a named graph and then add provenance metadata about that graph
reify every claim so that we can add provenance metadata directly to the claim

Quads

No standard. And doesn't save that much.

Reification

No one likes it.

Publish the ground triple

No.

To do

Representation of data values
Representation of references
Representation of the following Snaks: InstanceOf, SubclassOf, PropertyInterval, and PropertySomeInterval

Namespaces

w: for Wikidata items
o: for the Wikidata ontology (a fixed and small set of terms)
p: for Wikidata properties
q: for properties used as qualifiers
s: for properties used to connect items and statements
v: for properties used to connect statements and values
Berlin: used as a shortcut for w:Berlin but defined as a prefix
xsd:, rdf:, rdfs:, and owl: with their usual meanings