Wikidata/Development/RDF: Difference between revisions

Content deleted Content added
No edit summary
 
(40 intermediate revisions by 10 users not shown)
Line 1:
{{MovedToMediaWiki|Wikibase/Indexing/RDF_Dump_Format}}
This note both motivates and specifies the RDF export of Wikidata.
 
== General notes ==
The more you are familiar with the following topics the easier will it be for you to understand this note. If you cannot follow a part, feel free to use the references given here.
* [[:en:Resource Description Framework|RDF]] as the basic export model
* [[:en:Web Ontology Language|OWL]] for some constructs
* [[:en:Turtle (syntax)|Turtle]] as a syntax
* [[:en:Linked data|Linked data principles]]
* [[Wikidata/Data model|Wikidata data model]] or at least the [[Wikidata/Notes/Data model primer|Wikidata data model primer]]
 
Furthermore, throughout this note we will not use the URIs that will be used in the system later on, but rather suggestive and convenient URIs that help to follow the argumentation in this note. There exists a separate [[Wikidata/Notes/URI scheme|note on the Wikidata URI scheme]]. The namespaces used in this note are listed at the [[#Namespaces|bottom of the page]].
 
== Motivation ==
=== Example ===
For the following discussion, we introduce two statements with the same item and property, the first with two, the second with one qualifier:
 
<div style="padding: 2ex; border:#444 solid 1px; ">
{{Wikidata statement|item=Berlin|property=Population|value=3,499,879|qualifier1=as of|value1=November 30, 2012|qualifier2=Method|value2=Extrapolation}}
{{Wikidata statement|value=8,000|qualifier1=as of|value1=15th century|numberofsources=1}}
</div>
 
=== Ground triples ===
The simplest possible way to represent the first statement would be the following triple:
w:Berlin p:Population "3499879"^^xsd:int .
We will call this triple the ground triple of the first example statement. The ground triple completely omits the qualifiers, which in case of the first example does not seem too bad. But the second statement is giving the population of Berlin as 8,000 in the 15th century. This would lead to the following ground triple:
w:Berlin p:Population "8000"^^xsd:int .
If this ground triple was given too, Berlin would be in the result set when querying for cities with less than 10,000 inhabitants, which does not sound desirable.
 
=== Statements with qualifiers ===
Instead, we need to represent the statement including the qualifiers. Recommendations on how to represent this kind of data is given by the Semantic Web Best Practice note on [http://www.w3.org/TR/swbp-n-aryRelations/ Defining N-ary Relations on the Semantic Web]. In the following we follow their recommendation.
 
We have to introduce an intermediary node to represent the statement as such, and connect it to the item and the value. The following triples represent the claim from the first statement.
 
w:Berlin s:Population Berlin:Statement1 .
Berlin:Statement1 rdf:type o:Statement .
Berlin:Statement1 v:Population "3499879"^^xsd:int .
Berlin:Statement1 q:As_of "30-11-2011"^^xsd:date .
Berlin:Statement1 q:Method w:Extrapolation .
Berlin:Statement1 rdf:label "3,499,879 (As of Nov 30, 2011, Method Extrapolation)"^en .
 
Note that we introduced two new properties, <code>s:Population</code> and <code>v:Population</code> instead of the original <code>p:Population</code> property (mind the namespaces). The original property, in the <code>p:</code> namespace, is a datatype property that connects items directly with an integer value (i.e. the ___domain is item). The <code>s:</code> property, on the other hand, is an object property connecting the item with the statement about it.
 
For the same reason we introduced the <code>q:</code> properties for qualifiers, connecting a statement node with the value of the qualifier.
 
A user of the data could easily get the ground triple, either by adding an [http://www.w3.org/TR/2009/WD-owl2-primer-20090421/ OWL2 axiom] stating that <code>p:Population</code> is derived from the [http://www.w3.org/TR/2009/WD-owl2-primer-20090421/#Property_Chains property chain] of <code>s:Population</code> and <code>v:Population</code>, or by using a [http://www.w3.org/TR/sparql11-query/#construct SPARQL construct query] with the same effect.
 
One additional feature in order to support Semantic Web browsers is to give the statement a label that corresponds to a human-serialized form of the value including the qualifiers. The label would be exported in all available languages. This enables Semantic Web browsers to display the first triple in the above serialization in a useful way for the viewer of the data. We tested it with [http://marbles.sourceforge.net/ Marbles] and the [http://uriburner.com/ode/ OpenLink Date Explorer].
 
In order to see how ''none'' and ''unknown'' are represented, please refer to the specification below.
 
=== Statements with references ===
Now that we have a node for the statement, it becomes trivial to add a reference.
Berlin:Statement1 o:Reference Berlin:Statement1-Reference1 .
We will investigate if we can reuse an appropriate property from the [http://www.w3.org/TR/prov-o/ provenance ontology] here. Also we will further expand on how the references are modeled as soon as this is progressed further.
 
=== Representation of values ===
Note that the values might and usually are also further structured themselves. We will detail how the values are exported as we progress with specifying them further within Wikidata. Since values might often have a unit or an accuracy associated with them, they will often be represented as yet another intermediate node with the respective values attached to it. For our examples we only deal with the datatype ''entity''.
 
== Specification ==
The specification is defined as a mapping of the production rules for [http://meta.wikimedia.org/wiki/Wikidata/Data_model#Wikidata_Object_Notation Wikidata statements written in the Wikidata Object Notation] to production rules for [http://www.w3.org/TR/owl2-syntax/ OWL2 axioms written in Functional-style syntax]. The [http://www.w3.org/TR/owl2-mapping-to-rdf/ transformation of OWL2 axioms to RDF triples in turn is defined by the OWL2 standard]. The current specification is slightly simplified for now as it omits intervals and ranks.
 
{{User:Markus Krötzsch/BNFtable|
{{User:Markus Krötzsch/BNFdef|ItemDescription|'ItemDescription(' '''Item''' {'''Statement'''} ')'}}
{{User:Markus Krötzsch/BNFdef|Item|'''IRI'''}}
{{User:Markus Krötzsch/BNFdef|Property|'''IRI'''}}
{{User:Markus Krötzsch/BNFdef|Statement|'Statement(' '''Snak''' {'''PropertySnak'''} {'''ReferenceRecord'''} ')'}}
{{User:Markus Krötzsch/BNFdef|Snak|
'''PropertySnak''' {{!}} '''PropertyInstanceOfSnak''' {{!}} '''PropertySubclassOfSnak''' }}
{{User:Markus Krötzsch/BNFdef|PropertySnak|
'''PropertyValueSnak''' {{!}} '''PropertySomeValueSnak''' {{!}} '''PropertyNoValueSnak''' }}
{{User:Markus Krötzsch/BNFdef|PropertyValueSnak|'PropertyValueSnak(' '''Property''' '''Value''' ')'}}
{{User:Markus Krötzsch/BNFdef|PropertyNoValueSnak|'PropertyNoValueSnak(' '''Property''' ')'}}
{{User:Markus Krötzsch/BNFdef|PropertySomeValueSnak|'PropertySomeValueSnak(' '''Property''' ')'}}
}}
 
{{User:Markus Krötzsch/BNFtable|
{{User:Markus Krötzsch/BNFdef|ItemDescription|'ItemDescription(' '''Item''' {'''Statement'''} ')'}}
{{User:Markus Krötzsch/BNFdef|Item|'''IRI'''}}
{{User:Markus Krötzsch/BNFdef|Property|'''IRI'''}}
{{User:Markus Krötzsch/BNFdef|Statement|'Statement(' '''Snak''' {'''PropertySnak'''} {'''ReferenceRecord'''} ')'}}
{{User:Markus Krötzsch/BNFdef|Snak|
'''PropertySnak''' {{!}} '''PropertyInstanceOfSnak''' {{!}} '''PropertySubclassOfSnak''' }}
{{User:Markus Krötzsch/BNFdef|PropertySnak|
'''PropertyValueSnak''' {{!}} '''PropertySomeValueSnak''' {{!}} '''PropertyNoValueSnak''' }}
{{User:Markus Krötzsch/BNFdef|PropertyValueSnak|'PropertyValueSnak(' '''Property''' '''Value''' ')'}}
{{User:Markus Krötzsch/BNFdef|PropertyNoValueSnak|'PropertyNoValueSnak(' '''Property''' ')'}}
{{User:Markus Krötzsch/BNFdef|PropertySomeValueSnak|'PropertySomeValueSnak(' '''Property''' ')'}}
}}
 
== Discussion of alternatives ==
The following discussion is for giving rationales to the design decisions in this note and can be skipped if the reader is not interested in them.
 
=== Punning for the property names ===
Instead of having different properties in the <code>p:</code>, <code>s:</code>, <code>v:</code> and <code>q:</code> namespaces, turning every property in Wikidata to four in the RDF export, we could have used a single property and just use it in all the four use cases. In almost all cases it would still be clear which property is actually used:
<code>p:</code> connects the item with the value, <code>s:</code> the item with the statement, <code>v:</code> the statement with the value, and <code>q:</code> is used for qualifier, connecting the statement with the qualifier value. The only case of ambiguity would be between the <code>v:</code> and <code>q:</code> as they both connect the statement with a value, and it leads to an ambiguity if a qualifier with the same property is used as in the ground triple.
 
Punning is in general frowned upon on the Semantic Web, but it is not unheard of, e.g in OWL2 individuals and classes can be punned in general. Also a recent proposal for the [http://www.jenitennison.com/blog/node/170 notorious HTTP-range14 discussion proposes to use punning] as a solution.
 
We decided to not use punning in our case, but rather to accept the proliferation of properties. The reason for that is that it is not only best practice, but also necessary for the
OWL2 DL serialization to be valid OWL as we need to make a clear difference between object and datatype properties.
 
We also wanted to ensure that the vocabulary developed within Wikidata will be reusable outside of Wikidata. As in most cases we expect external data publishers to use the direct representation of data with triples -- i.e. just the ground triple -- we also wanted to ensure that such a property is avaiable for external reuse. Ironically this property is not the one we use in the Wikidata export itself, but there are well defined relationships between them, as described above.
 
=== Named graphs ===
Giving a claim a reference can be done in three ways:
* put every claim in one file and then add provenance metadata about that file
* put every claim in a named graph and then add provenance metadata about that graph
* reify every claim so that we can add provenance metadata directly to the claim
 
=== Quads ===
No standard. And doesn't save that much.
 
=== Reification ===
No one likes it.
 
=== Publish the ground triple ===
No.
 
== To do ==
* Representation of data values
* Representation of references
* Representation of the following Snaks: InstanceOf, SubclassOf, PropertyInterval, and PropertySomeInterval
 
== Namespaces ==
* <code>w:</code> for Wikidata items
* <code>o:</code> for the Wikidata ontology (a fixed and small set of terms)
* <code>p:</code> for Wikidata properties
* <code>q:</code> for properties used as qualifiers
* <code>s:</code> for properties used to connect items and statements
* <code>v:</code> for properties used to connect statements and values
* <code>Berlin:</code> used as a shortcut for <code>w:Berlin</code> but defined as a prefix
* <code>xsd:</code>, <code>rdf:</code>, <code>rdfs:</code>, and <code>owl:</code> with their usual meanings