Wikidata/Development/RDF
This note both motivates and specifies the RDF export of Wikidata.
General notes
The more you are familiar with the following topics the easier will it be for you to understand this note. If you cannot follow a part, feel free to use the references given here.
- RDF as the basic export model
- OWL for some constructs
- Turtle as a syntax
- Linked data principles
- Wikidata data model or at least the Wikidata data model primer
Furthermore, throughout this note we will not use the URIs that will be used in the system later on, but rather suggestive and convenient URIs that help to follow the argumentation in this note. There exists a separate note on the Wikidata URI scheme. The namespaces used in this note are listed at the bottom of the page.
Motivation
Example
For the following discussion, we introduce two statements with the same item and property, the first with two, the second with one qualifier:
Berlin
Population | 3,499,879 | [no sources] |
as of November 30, 2012 | ||
Method Extrapolation |
8,000 | [1 source] | |
as of 15th century |
Ground triples
The simplest possible way to represent the first statement would be the following triple:
w:Berlin p:Population "3499879"^^xsd:int .
We will call this triple the ground triple of the first example statement. The ground triple completely omits the qualifiers, which in case of the first example does not seem too bad. But the second statement is giving the population of Berlin as 8,000 in the 15th century. This would lead to the following ground triple:
w:Berlin p:Population "8000"^^xsd:int .
If this ground triple was given too, Berlin would be in the result set when querying for cities with less than 10,000 inhabitants, which does not sound desirable.
Statements with qualifiers
Instead, we need to represent the statement including the qualifiers. Recommendations on how to represent this kind of data is given by the Semantic Web Best Practice note on Defining N-ary Relations on the Semantic Web. In the following we follow their recommendation.
We have to introduce an intermediary node to represent the statement as such, and connect it to the item and the value. The following triples represent the claim from the first statement.
w:Berlin s:Population Berlin:Statement1 . Berlin:Statement1 rdf:type o:Statement . Berlin:Statement1 v:Population "3499879"^^xsd:int . Berlin:Statement1 q:As_of "30-11-2011"^^xsd:date . Berlin:Statement1 q:Method w:Extrapolation . Berlin:Statement1 rdf:label "3,499,879 (As of Nov 30, 2011, Method Extrapolation)"^en .
Note that we introduced two new properties, s:Population
and v:Population
instead of the original p:Population
property (mind the namespaces). The original property, in the p:
namespace, is a datatype property that connects items directly with an integer value (i.e. the ___domain is item). The s:
property, on the other hand, is an object property connecting the item with the statement about it.
For the same reason we introduced the q:
properties for qualifiers, connecting a statement node with the value of the qualifier.
A user of the data could easily get the ground triple, either by adding an OWL2 axiom stating that p:Population
is derived from the property chain of s:Population
and v:Population
, or by using a SPARQL construct query with the same effect.
One additional feature in order to support Semantic Web browsers is to give the statement a label that corresponds to a human-serialized form of the value including the qualifiers. The label would be exported in all available languages. This enables Semantic Web browsers to display the first triple in the above serialization in a useful way for the viewer of the data. We tested it with Marbles and the OpenLink Date Explorer.
In order to see how none and unknown are represented, please refer to the specification below.
Statements with references
Now that we have a node for the statement, it becomes trivial to add a reference.
Berlin:Statement1 o:Reference Berlin:Statement1-Reference1 .
We will investigate if we can reuse an appropriate property from the provenance ontology here. Also we will further expand on how the references are modeled as soon as this is progressed further.
Representation of values
Note that the values might and usually are also further structured themselves. We will detail how the values are exported as we progress with specifying them further within Wikidata. Since values might often have a unit or an accuracy associated with them, they will often be represented as yet another intermediate node with the respective values attached to it. For our examples we only deal with the datatype entity.
Specification
The specification is defined as a mapping of Wikidata statements written in the Wikidata Object Notation to OWL2 axioms written in Functional-style syntax. The transformation of OWL2 axioms to RDF triples in turn is defined by the OWL2 standard, the result is given here for convenience as well. The current specification is slightly simplified for now as it omits PropertyIntervalSnak
, PropertySomeInertvalSnak
, PropertyInstanceOfSnak
, PropertySubclassOfSnak
, and Rank
.
The following is a reiteration of the relevant part of the Wikidata Object Notation.
ItemDescription := | 'ItemDescription(' Item {Statement} ')' |
Statement := | 'Statement(' MainSnak {Qualifier} {ReferenceRecord} ')' |
Every statement is translated into a number of OWL2 axioms as described below. Every statement is identified by a StatementID
which is a IRI
.
MainSnak a PropertyValueSnak
If the MainSnak
is a PropertyValueSnak
, then it is translated as follows:
ObjectPropertyAssertion( s:Property Item StatementID ) ClassAssertion( o:Statement StatementID ) ObjectPropertyAssertion( v:Property StatementID Value ) Annotation( StatementID rdfs:label ValueLabel(Statement) )
s:Property
is an IRI
that has the same local name as Property but replaces the p:
namespace with the s:
namespace. The same is defined for v:Property
and q:Property
respectively. The function ValueLabel
returns an appropriate label describing the value of the statement including the qualifiers (not defined here).
In RDF this results in:
Item s:Property StatementID . StatementID rdf:type o:Statement . StatementID v:Property Value . StatementID rdfs:label ValueLabel(Statement) .
MainSnak a PropertySomeValueSnak
If the MainSnak
is a PropertySomeValueSnak
, then it is translated as follows:
ObjectPropertyAssertion( s:Property Item StatementID ) ClassAssertion( o:Statement StatementID ) ClassAssertion( ObjectSomeValuesFrom( v:Property owl:Thing ) StatementID ) Annotation( StatementID rdfs:label ValueLabel(Statement) )
In RDF this results in:
Item s:Property StatementID . StatementID rdf:type o:Statement . StatementID rdf:type _:1 . _:1 rdf:type owl:Restriction . _:1 owl:onProperty v:Property . _:1 owl:someValuesFrom owl:Thing . StatementID rdfs:label ValueLabel(Statement) .
MainSnak a PropertyNoValueSnak
If the MainSnak
is a PropertyNoValueSnak
, then it is translated as follows:
ObjectPropertyAssertion( s:Property Item StatementID ) ClassAssertion( o:Statement StatementID ) ClassAssertion( ObjectAllValuesFrom( v:Property owl:Nothing ) StatementID ) Annotation( StatementID rdfs:label ValueLabel(Statement) )
In RDF this results in:
Item s:Property StatementID . StatementID rdf:type o:Statement . StatementID rdf:type _:1 . _:1 rdf:type owl:Restriction . _:1 owl:onProperty v:Property . _:1 owl:allValuesFrom owl:Nothing . StatementID rdfs:label ValueLabel(Statement) .
Qualifier
Each Qualifier
is translated as follows. If the Qualifier
is a PropertyValueSnak
, then it is translated as follows:
ObjectPropertyAssertion( q:Property StatementID Value )
In RDF this results in:
StatementID q:Property Value .
If the Qualifier
is a PropertySomeValueSnak
, then it is translated as follows:
ClassAssertion( ObjectSomeValuesFrom( Q(Property) owl:Thing ) StatementID )
In RDF this results in:
StatementID rdf:type _:1 . _:1 rdf:type owl:Restriction . _:1 owl:onProperty v:Property . _:1 owl:someValuesFrom owl:Thing .
If the Qualifier
is a PropertyNoValueSnak
, then it is translated as follows:
ClassAssertion( ObjectAllValuesFrom( V(Property) owl:Nothing ) StatementID )
In RDF this results in:
StatementID rdf:type _:1 . _:1 rdf:type owl:Restriction . _:1 owl:onProperty v:Property . _:1 owl:allValuesFrom owl:Nothing .
ReferenceRecord
Every ReferenceRecord
is given a ReferenceID
which is an IRI
. Every ReferenceRecord
is translated as follows:
ObjectPropertyAssertion( o:Reference StatementID ReferenceID )
In RDF this results in:
StatementID o:Reference ReferenceID .
Discussion of alternatives
The following discussion is for giving rationales to the design decisions in this note and can be skipped if the reader is not interested in them.
Punning for the property names
Instead of having different properties in the p:
, s:
, v:
and q:
namespaces, turning every property in Wikidata to four in the RDF export, we could have used a single property and just use it in all the four use cases. In almost all cases it would still be clear which property is actually used:
p:
connects the item with the value, s:
the item with the statement, v:
the statement with the value, and q:
is used for qualifier, connecting the statement with the qualifier value. The only case of ambiguity would be between the v:
and q:
as they both connect the statement with a value, and it leads to an ambiguity if a qualifier with the same property is used as in the ground triple.
Punning is in general frowned upon on the Semantic Web, but it is not unheard of, e.g. in OWL2 individuals and classes can be punned in general. Also a recent proposal for the notorious HTTP-range14 discussion proposes to use punning as a solution.
We decided to not use punning in our case, but rather to accept the proliferation of properties. The reason for that is that it is not only best practice, but also necessary for the OWL2 DL serialization to be valid OWL as we need to make a clear difference between object and datatype properties.
We also wanted to ensure that the vocabulary developed within Wikidata will be reusable outside of Wikidata. As in most cases we expect external data publishers to use the direct representation of data with triples—i.e. just the ground triple—we also wanted to ensure that such a property is available for external reuse. Ironically this property is not the one we use in the Wikidata export itself, but there are well defined relationships between them, as described above.
Named graphs
Giving a claim a reference can be done in three ways:
- put every claim in one file and then add provenance metadata about that file
- put every claim in a named graph and then add provenance metadata about that graph
- reify every claim so that we can add provenance metadata directly to the claim
Quads
No standard. And doesn't save that much.
Reification
No one likes it.
Publish the ground triple
No.
To do
- Representation of data values
- Representation of references
- Representation of rank
- Representation of the following Snaks: InstanceOf, SubclassOf, PropertyInterval, and PropertySomeInterval
Namespaces
w:
for Wikidata itemso:
for the Wikidata ontology (a fixed and small set of terms)p:
for Wikidata propertiesq:
for properties used as qualifierss:
for properties used to connect items and statementsv:
for properties used to connect statements and valuesBerlin:
used as a shortcut forw:Berlin
but defined as a prefixxsd:
,rdf:
,rdfs:
, andowl:
with their usual meanings