Wikipedia:Authority control integration proposal: Difference between revisions
Content deleted Content added
Andrew Gray (talk | contribs) →The proposal: exp |
clearer |
||
(40 intermediate revisions by 6 users not shown) | |||
Line 1:
{{Superseded|[[Wikipedia:Authority control integration proposal/FAQ|the project FAQ]], as '''the project has now concluded'''}}
{{Notice|If you've '''found an error''' with one of the VIAF codes added, please [[Wikipedia:VIAF/errors|list it here]].}}
The current proposal focuses on biographies, although this may be extended in future to cover other topics, and on the use of identifiers from [[VIAF]], a composite system bringing together several major authority files. VIAF algorithmically matches and clusters entries from the individual authority files, and uses data scraped from Wikipedia to aid the process; as a result, there have already been a large number of Wikipedia-VIAF matched pairs identified.▼
==Video Summary of the proposal==
The proposal was originally written up here, and discussed on the Village Pump [http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(proposals)&oldid=499166049#Authority_Control_Integration here]. It has since been updated to include some of the feedback and commentary received during the discussions. While the Village Pump discussion was broadly favourable, it will soon be formally listed as an RFC in order to ensure clear support from the community before implementation later in 2012.▼
On [http://www.youtube.com/watch?v=uwwTNmJUQ8w youtube].
==Introduction==
This plan is being coordinated by [[User:Maximiliankleinoclc|Max Klein]], the Wikipedian in Residence at [[OCLC]], and [[User:Andrew Gray|Andrew Gray]], the Wikipedian in Residence at the British Library. OCLC are the central operating group for VIAF, and have offered to provide technical support for the matching process. If you would like to help work on it, please [[Wikipedia talk:Authority control integration project|let us know]].▼
This proposed project intends to extend and systematise the use of [[authority control]] identifiers, using the {{tl|Authority control}} template, on English Wikipedia articles. ''Authority control'' is the [[term-of-art]] in librarianship, archival practice and related fields for [[unique identifiers]] to [[Wikipedia:Disambiguation|disambiguate]] objects (people, places, academic subjects, etc.). These fields of study have different conceptualisations of unique identifiers form some other fields because many systems in place are backwards-compatible to pre-computerisation systems. This project aims to connect the English Wikipedia to this [[long tail]] of identifiers.
▲The current proposal focuses on biographies, although this may be extended in future to cover other topics, and
▲The proposal was originally written up here, and [[Wikipedia:Village pump (proposals)/Archive 89#section Authority Control Integration|discussed on the Village Pump
▲This plan is being coordinated by [[User:Maximiliankleinoclc|Max Klein]], the Wikipedian in Residence at [[OCLC]], and [[User:Andrew Gray|Andrew Gray]], the Wikipedian in Residence at the British Library. OCLC are the central operating group for VIAF, and have offered to provide technical support for the matching process. If you would like to help work on it, please [[Wikipedia talk:Authority control integration
==Background==
Line 11 ⟶ 19:
{{main|Wikipedia:Authority control}}
[[Authority control]] is a system primarily used in libraries and other metadata services, where a single entity is given a canonical unique identifier. This allows clear disambiguation between different entities with similar names, while also allowing the use of a single identifier for those with multiple variant names. On Wikipedia, this is handled with the {{tl|authority control}} template, which places the identifiers at the end of the article and links out to library catalogues and central authority databases.
As well as these reader-visible links, the embedded data helps build infrastructure for future work, such as:
Currently, around 4,000 articles on the English Wikipedia have some form of embedded authority control identifier. On the German Wikipedia, by comparison, around 220,000 articles have embedded identifiers.▼
*'''Reliable linking from external services''' - we can build lookup services, such as this tool for the German Wikipedia's PND files: http://toolserver.org/~apper/pd/person/pnd-redirect/de/118768581 - which takes you to the article represented by that PND. Such tools allow people to automatically generate links to Wikipedia without guessing at article titles, use the API to pull out leads from articles for reuse in other sites, etc.
*'''Extending the scope for checking metadata''' - we already have methods, such as the [[Wikipedia:Death anomalies project|Death anomalies project]], for comparing the metadata between Wikipedia language editions and spotting inconsistencies. Including identifiers which tie into external services, with reliable APIs, give us a lot of additional data for cross-checking.
*'''Returning metadata to the outside world''' - working backwards from this, once we have embedded identifiers, the curators of this metadata will find it a lot easier to incorporate information from Wikipedia, taking advantage of our fairly fast update cycle for things like death dates.
*'''Identifying alternate names''' - particularly for non-standard transliterations, the alternate headings in authority files give us an extensive and curated collection of variants of names. The linkage will help the creation of redirects.
*'''Content creation support''' - the presence of the identifiers allows future work on tools to, e.g., develop scripts to generate author's bibliographies for articles.
▲Currently, around 4,000 articles on the English Wikipedia have some form of embedded authority control identifier, and on Commons, around 45,000 articles contain authority control. On the German Wikipedia, by comparison, [[:de:Wikipedia:Normdaten|around 220,000 articles]] have embedded identifiers.
==The proposal==
This
It is built around use of the [[VIAF|Virtual International Authority File]] (VIAF), an international project to merge multiple national authority files into a single master system. VIAF identifiers correspond to identifiers in other systems, and can be used in parallel with, or instead of, these other identifiers.
It will involve identifying an appropriate VIAF identifier number for as many articles as possible, using a number of different methods ranked by probable accuracy. Following this, and testing of the data to ensure it is consistent and accurate, the identifier will be added to these articles by a bot, using an extended version of the {{tl|Authority control}} template.▼
▲
===Data sources===
There are
#'''Articles already using {{tl|Authority control}}'''.
#'''Interwikied articles with identifiers'''.
#:''Around [[:de:Vorlage:NORMDATENCOUNT|145,000 articles]] on the German Wikipedia currently have VIAF identifiers; the rest use other identities, but it may be practical to match them to VIAF.''
#'''VIAF authority file links'''. #:''(The matching is done with this [http://dl.dropbox.com/u/10997393/wikipedia2auth3.py python code] written by OCLC Research Scientists Thom Hickey and Jenny Toves. During the algorithmic creation of the VIAF file if a Wikipedia link is matched with ~98% accuracy then it is included in the entry. Right now there are 266,202 links from VIAF to Wikipedia. Those links are available [http://dl.dropbox.com/u/10997393/wikilinks.out as a tab-delimited text file].)''
===Implementation===
The implementation will be done in
# Create
# Prior to the bot run, {{tl|Authority control}} will be redeveloped to ensure it scales effectively to the new usage, creating sub-templates for specific identifiers. The documentation for this template, along with [[Wikipedia:Authority control]], will be checked and updated or overhauled where necessary.
# A bot will be developed and tested, then approved through [[WP:BRFA|the standard bot approval process]] to ensure there are no technical problems and that it is compliant with this proposal.
#
# Finally, this bot will run periodic reports in conjunction with the VIAF update schedule, to reflect any reshuffling that occurs in the file.
{{:User:Maximiliankleinoclc/VIAF graphical timeline}}
Line 62 ⟶ 71:
===Template details===
The template currently used to handle authority control data is {{tl|Authority control}}; it is placed at the extreme end of the article, just above the categories, and displays a narrow box with the identifiers. These link to an external service. For an example, see [[
As part of this project, we will need to rewrite {{tl|authority control}} to form a wrapper for a number of subsidiary templates, each handling a specific identifier. This will make it easier to maintain as well as easier to develop support for other identifiers, without the need for experimentation on a template used on several hundred thousand pages. Documentation on {{tl|authority control}}, [[Wikipedia:Authority control]], and related pages will be updated accordingly.▼
{{Wikipedia:Authority_control_integration_proposal/FAQ}}
▲As part of this project, we will need to rewrite {{tl|authority control}} to form a wrapper for a number of subsidiary templates, each handling a specific identifier. This will make it easier to maintain as well as easier to develop support for other identifiers, without the need for experimentation on a template used on several hundred thousand pages.
- [[User:Maximiliankleinoclc|Max Klein]], OCLC Wikipedian in Residence, and [[User:Andrew Gray|Andrew Gray]], British Library Wikipedian in Residence.▼
==Progress==
Now that [[Wikipedia:Authority_control_integration_proposal/RFC|RFC]] has passed, the work of the bot is underway. Code can be viewed at [https://github.com/notconfusing/VIAFbot github].
▲- [[User:Maximiliankleinoclc|Max Klein]], OCLC Wikipedian in Residence, and [[User:Andrew Gray|Andrew Gray]], British Library Wikipedian in Residence.
|