Wikipedia:Authority control integration proposal: Difference between revisions

Content deleted Content added
clearer
 
(47 intermediate revisions by 6 users not shown)
Line 1:
{{Superseded|[[Wikipedia:Authority control integration proposal/FAQ|the project FAQ]], as '''the project has now concluded'''}}
==Towards deep authority control integration in Wikipedia==
 
{{Notice|If you've '''found an error''' with one of the VIAF codes added, please [[Wikipedia:VIAF/errors|list it here]].}}
==[[Wikipedia:Village_pump_(proposals)#Authority_Control_Integration]]==
{{Seemain| User:Maximiliankleinoclc/Authority control integration}}
I am Max Klein the [[outreach:Wikipedian_in_Residence#Wikipedians_in_Residence| Wikipedian in Residence]] for non-profit member cooperative [[OCLC]]. There is a template {{tl|Authority control}} which is in use on 4,000 pages to link to [[VIAF]]. I've determined [http://dl.dropbox.com/u/10997393/wikilinks.out ~260,000] more where the template would be useful with the corresponding VIAF number. This project and template serve the educational goal of gathering more officiate information about Creators. VIAF contains information that is not currently on the encyclopedia but would be useful if a reader wanted to learn more about their publishing history and preferred name forms. Are there any objections before I undertake the effort to write a bot add this template?
 
==ExcutiveVideo Summary of the proposal==
On [http://www.youtube.com/watch?v=uwwTNmJUQ8w youtube].
[[Authority control]] is a library practice of creating official records to disambiguate articles and creating connections between them. So far German Wikipedia has linked author articles to their national library unique identifiers in nearly all appropriate cases. On English Wikipedia an analogous effort is less than one percent complete. The [[VIAF| Virtual International Authority File (VIAF)]] is an international project to merge all national authority files into a single master. The [http://dl.dropbox.com/u/10997393/wikipedia2auth3.py algorithm] which VIAF relies on for matching the national files uses Wikipedia for hints, and in the process links to the Wikipedia article of the author when matches are made. Using a bot we could make [http://dl.dropbox.com/u/10997393/wikilinks.out .25 million reciprocal links] from Wikipedia to VIAF, using a template that already exists in 4,000 articles. The debate focuses around whether we still think that the {{tl|Authority control}} template is useful, and should be scaled. Long term this could have strong [[m:Wikidata|Wikidata]] implications in that links to VIAF could be used to draw in properties about the authors - landmark dates, and major works, preferred names, etc. - and propagate authoritative metadata across all language Wikipedias.
 
==ProposalIntroduction==
This proposed project intends to extend and systematise the use of [[authority control]] identifiers, using the {{tl|Authority control}} template, on English Wikipedia articles. ''Authority control'' is the [[term-of-art]] in librarianship, archival practice and related fields for [[unique identifiers]] to [[Wikipedia:Disambiguation|disambiguate]] objects (people, places, academic subjects, etc.). These fields of study have different conceptualisations of unique identifiers form some other fields because many systems in place are backwards-compatible to pre-computerisation systems. This project aims to connect the English Wikipedia to this [[long tail]] of identifiers.
 
The current proposal focuses on biographies, although this may be extended in future to cover other topics, and is built around the use of data from [[VIAF]], a composite system bringing together several major authority files. VIAF algorithmically matches and clusters entries from the individual authority files, and uses data scraped from Wikipedia to aid the process; as a result, there have already been a large number of Wikipedia-VIAF matched pairs identified and this provides a very effective springboard to work from.
===Introduction===
The utility of [[authority control]] systems are understood by both the library community and Wikipedians alike. For the library community, authority files are indispensable tools to precise cataloging. To Wikipedians, the inclusion of authority control is part of the march towards building a better encyclopedia with more structured data. The [[VIAF|Virtual International Authority File]] is a joint project of 20 national libraries and operated by the Online Computer Library Center (OCLC), to combine the disambiguating [[authority file|authority files]] into one source. It algorithmically matches and clusters the agreements between the national authority files, and uses data scraped from Wikipedia to aid the process.
 
The proposal was originally written up here, and [[Wikipedia:Village pump (proposals)/Archive 89#section Authority Control Integration|discussed on the Village Pump]]. It has since been updated to include some of the feedback and commentary received during the discussions. While the Village Pump discussion was broadly favourable, it has been formally listed as an RFC in order to ensure clear support from the community before implementation later in 2012.
This project is not the first effort to merge authority control with Wikipedia, but rather aims to build on previous projects; its main goal is to prepare infrastructure for use in Wikidata and future interoperability with VIAF, and linked data oppportunites that such a bridge would confer.
 
This plan is being coordinated by [[User:Maximiliankleinoclc|Max Klein]], the Wikipedian in Residence at [[OCLC]], and [[User:Andrew Gray|Andrew Gray]], the Wikipedian in Residence at the British Library. OCLC are the central operating group for VIAF, and have offered to provide technical support for the matching process. If you would like to help work on it, please [[Wikipedia talk:Authority control integration proposal|let us know]].
===What's happening already?===
 
==Background==
German Wikipedia implemented a comprehensive program to match articles with authority identifiers several years ago; a similar project on the English Wikipedia, using {{tl|Authority control}}, has gained some traction but has only covered ~0.4% (4,308 transclusions) of the project's biographies, against a ~50% (219,428 transclusions) coverage rate in German.
There is also a automated way of human integrating these information by way of a [[commons:Help:Gadget-VIAFDataImporter| Gadget]].
 
{{main|Wikipedia:Authority control}}
Parallel to this, the [[Wikipedia:Persondata|"Persondata" structured data]] system has been widely rolled out on the English Wikipedia; during 2011, the proportion of biographies with persondata leapt from under 10% to well over 90%. This wealth of structured data means that there is a good opportunity to try and link English Wikipedia articles ''en masse'' in the next few months; if it can be done soon, it will help support the deployment of the cross-project [[Wikidata]] later in 2012-13.
 
[[Authority control]] is a system primarily used in libraries and other metadata services, where a single entity is given a canonical unique identifier. This allows clear disambiguation between different entities with similar names, while also allowing the use of a single identifier for those with multiple variant names. On Wikipedia, this is handled with the {{tl|authority control}} template, which places the identifiers at the end of the article and links out to library catalogues and central authority databases.
 
As well as these reader-visible links, the embedded data helps build infrastructure for future work, such as:
 
*'''Reliable linking from external services''' - we can build lookup services, such as this tool for the German Wikipedia's PND files: http://toolserver.org/~apper/pd/person/pnd-redirect/de/118768581 - which takes you to the article represented by that PND. Such tools allow people to automatically generate links to Wikipedia without guessing at article titles, use the API to pull out leads from articles for reuse in other sites, etc.
There are a number of other templates which use non-standard identifiers for individuals in similar ways - {{tl|OL author}} links to author records on Open Library, {{tl|Gutenberg author}} to index entries on Gutenberg, etc. These may potentially be convertible to a uniform identifier.
*'''Extending the scope for checking metadata''' - we already have methods, such as the [[Wikipedia:Death anomalies project|Death anomalies project]], for comparing the metadata between Wikipedia language editions and spotting inconsistencies. Including identifiers which tie into external services, with reliable APIs, give us a lot of additional data for cross-checking.
*'''Returning metadata to the outside world''' - working backwards from this, once we have embedded identifiers, the curators of this metadata will find it a lot easier to incorporate information from Wikipedia, taking advantage of our fairly fast update cycle for things like death dates.
*'''Identifying alternate names''' - particularly for non-standard transliterations, the alternate headings in authority files give us an extensive and curated collection of variants of names. The linkage will help the creation of redirects.
*'''Content creation support''' - the presence of the identifiers allows future work on tools to, e.g., develop scripts to generate author's bibliographies for articles.
 
Currently, around 4,000 articles on the English Wikipedia have some form of embedded authority control identifier, and on Commons, around 45,000 articles contain authority control. On the German Wikipedia, by comparison, [[:de:Wikipedia:Normdaten|around 220,000 articles]] have embedded identifiers.
===How would we get the data?===
 
==The proposal==
As mentioned before VIAF already uses Wikipedia in it's algorithm to help it cluster and match the multitude of national authority files. The VIAF entries themselves took data from 788,582 records created from wiki dump, using [http://dl.dropbox.com/u/10997393/wikipedia2auth3.py python code] written by OCLC Research Scientists Thom Hickey and Jenny Toves. During the algorithmic creation of the VIAF file if a Wikipedia link is matched with ~98% accuracy then it is included in the entry. Right now there are 266,202 links from VIAF to Wikipedia. Those links are available [http://dl.dropbox.com/u/10997393/wikilinks.out as a tab-delimited text file].
 
This initial proposal focuses on identifiers in biographies; however, it is not intended to be exclusive, and the system can be extended in future to other articles if there is community support for it.
Other techniques that might be useful would be the use of {{tl|normdaten}} and {{tl|authority control}} templates that had GND or LCCN, but not VIAF variables since those are a subset of VIAF. That is any GND or LCCN corresponds to a VIAF identifier and conversion can occur between them.
 
It is built around use of the [[VIAF|Virtual International Authority File]] (VIAF), an international project to merge multiple national authority files into a single master system. VIAF identifiers correspond to identifiers in other systems, and can be used in parallel with, or instead of, these other identifiers.
====Licensing====
VIAF is ODC-BY and OCLC considers using the canonical URI attribution. Therefore for this proposed plan there would appear to be no licensing conflicts.
 
The process will involve identifying an appropriate VIAF identifier to match to as many articles as possible, using a number of different methods ranked by probable accuracy. Following this, and testing of the data to ensure it is consistent and accurate, a VIAF identifier will be added to these articles by a bot, using an extended version of the {{tl|Authority control}} template. This tool can later be reused to include other identifiers, such as [[LCCN]], if desired.
===Short-term Integration===
 
===Data sources===
There are potentially two ways to include the identifiers:
 
There are three available sources of data:
#We do it visibly; we roll out or complete the existing {{tl|Authority control}} template using by making the reciprocal links to the VIAF->Wikipeida links. A bot that will be created for the purpose, by a combination of interested community developers, and OCLC developer resources if necessary.
##''Pros:'' Gains mindshare on the importance of the project, and precedence for linking more library sources. Easy use of the identifiers for readers.
##''Cons:'' This will put external links on several hundred thousand pages, which may cause community disputes about which sources to use and whether this is appropriate. The {{tl|Authority control}} template is occasionally challenged as visual clutter and may not be appropriate on some pages.
#We do it invisibly; we either create a new non-displaying template and tracker category, or we leverage an existing one - {{tl|persondata}}, for example, and add an identifier to that.
##''Pros:'' Less controversial and still builds infrastructure for potential Wikidata use. Editors are still able to choose to use {{tl|Authority control}}, but are not forced to do so.
##''Cons:'' Raises no awareness in the short term for the project or the use of VIAF.
 
#'''Articles already using {{tl|Authority control}}'''. Some of these will have VIAF numbers. Where they do not, we can use the LCCN/GND numbers to match a VIAF number and include it in the existing template.
Either of these is workable; it's really a matter of what the community RFC chooses as the best way of doing it. My (Andrew's) personal preference is to do it entirely invisibly (possibly in persondata); this would leave the opportunity for people to add visible linkage templates only when it seems editorially appropriate as a start.
#'''Interwikied articles with identifiers'''. Around 220,000 articles in the German Wikipedia have identifiers. Where an interwiki to the German Wikipedia exists, we can pull the identifier from the linked page, doing some basic metadata checks to ensure the interwiki linkage is accurate.
#:''Around [[:de:Vorlage:NORMDATENCOUNT|145,000 articles]] on the German Wikipedia currently have VIAF identifiers; the rest use other identities, but it may be practical to match them to VIAF.''
#'''VIAF authority file links'''. As part of the matching process, Wikipedia is used as a source of information to help bring VIAF "clusters" together. OCLC have provided an extracted list of over 250,000 English Wikipedia articles with corresponding VIAF numbers, though these may have to be checked to ensure that pages have not been moved since the matching was carried out.
As mentioned before VIAF already uses Wikipedia in it#:'s algorithm to help it cluster and match the multitude of national authority files. '(The VIAFmatching entriesis themselvesdone took data from 788,582 records created from wiki dump,with usingthis [http://dl.dropbox.com/u/10997393/wikipedia2auth3.py python code] written by OCLC Research Scientists Thom Hickey and Jenny Toves. During the algorithmic creation of the VIAF file if a Wikipedia link is matched with ~98% accuracy then it is included in the entry. Right now there are 266,202 links from VIAF to Wikipedia. Those links are available [http://dl.dropbox.com/u/10997393/wikilinks.out as a tab-delimited text file]. )''
 
===Implementation===
===Long-Term Wikidata Integration===
 
The implementation will be done in stages.
Wikidata has the potential to be a “game changer” and that it will “fundamentally alter the way we think about Wikipedia.” We need to imagine a world where each VIAF entity, Bibliographic entity, and Wikidata entity had it’s own [[URI| Uniform Resource Identifier (URI)]]. Each Wikidata URI that was an author or book would link to the VIAF and could automatically read live data from the linked data and be negotiated upon to deliver item properties. These item properties could serve to dynamically generate infoboxes.
 
# Create lists of page titles and associated VIAF cluster IDs from the enwiki dataset, the dewiki dataset, and the VIAF dataset. These will then be sampled to check for accuracy.
Also, VIAF might possibly be a good set of seed data for Wikidata because it represents multi-lingual linked concepts. Furthermore once clusters form in Wikidata there be concepts with GND identifiers with out VIAF identifiers, and these could be related which would help to contribute back more accurate matching of VIAF.
# Prior to the bot run, {{tl|Authority control}} will be redeveloped to ensure it scales effectively to the new usage, creating sub-templates for specific identifiers. The documentation for this template, along with [[Wikipedia:Authority control]], will be checked and updated or overhauled where necessary.
# A bot will be developed and tested, then approved through [[WP:BRFA|the standard bot approval process]] to ensure there are no technical problems and that it is compliant with this proposal.
# This bot will add {{tl|authority control}} along with the VIAF codes from this list, once testing is complete.
# Finally, this bot will run periodic reports in conjunction with the VIAF update schedule, to reflect any reshuffling that occurs in the file.
 
===Thoughts?===
 
The central crux of this proposal is to reassess the utility and acceptance of the {{tl|Authority control}} template. Do we want to create approximately a one-quarter-million edits in conjunction with this template? And if so, does that also imply that a bot to make these edits should be approved once it is proved to be technically sound?
 
{{:User:Maximiliankleinoclc/VIAF graphical timeline}}
 
#Onwiki discussion to develop proposal (by late June)
#RfC on finalised proposal (by mid-July)
#Creation of processes and bots; bot approval (by end of July)
#Deployment of content (through August)
#''Future'': Wikidata integration (part of [[:meta:Wikidata/Technical proposal#Phase 2: Infoboxes|phase 2 of Wikidata]] - entirely dependent on that schedule)
#Documentation (through August)
#Maintenance (...ongoing...)
 
===Template details===
If the community RFC decides not to go ahead with the project, we'll still be able to pass the data generated so far to the Wikidata team, and hopefully it can be used there.
 
The template currently used to handle authority control data is {{tl|Authority control}}; it is placed at the extreme end of the article, just above the categories, and displays a narrow box with the identifiers. These link to an external service. For an example, see [[Fyodor Dostoyevsky#External links|Fyodor Dostoyevsky]] - this uses GND, LCCN, and VIAF codes, and is nested under a navigational template following the external links. It will only be used on "main" articles, and not on subpages or related bibliographies - no two articles should share an identifier.
Any comments, criticisms, etc. gratefully received.
 
As part of this project, we will need to rewrite {{tl|authority control}} to form a wrapper for a number of subsidiary templates, each handling a specific identifier. This will make it easier to maintain as well as easier to develop support for other identifiers, without the need for experimentation on a template used on several hundred thousand pages. Documentation on {{tl|authority control}}, [[Wikipedia:Authority control]], and related pages will be updated accordingly.
 
{{Wikipedia:Authority_control_integration_proposal/FAQ}}
 
- [[User:Maximiliankleinoclc|Max Klein]], OCLC Wikipedian in Residence, and [[User:Andrew Gray|Andrew Gray]], British Library Wikipedian in Residence.
 
==Progress==
 
Now that [[Wikipedia:Authority_control_integration_proposal/RFC|RFC]] has passed, the work of the bot is underway. Code can be viewed at [https://github.com/notconfusing/VIAFbot github].