Wikipedia Diversity Observatory
Languages | Get Involved |
Wikipedia Diversity Observatory
A project to understand and increase diversity within Wikipedia content and communities |
Research Resources
|
The Wikipedia Cultural Diversity Observatory (WCDO) is a space to study Wikipedia's diversity coverage, discuss the strategic needs and propose solutions to improve it.

To do so, it aims at raising awareness on Wikipedia’s current state of diversity by providing datasets, visualizations and statistics, as well as pointing out solutions and tools.
Mission
This project’s vision is to align the movement to achieve cultural diversity in the different projects content.
This project's mission is to create a joint space for researchers and activists to study and fight against the cultural knowledge gaps and promote knowledge equity. Hence, we provide strategic valuable data and resources to organize and take action.
This project is especially motivated by the Africa knowledge gap (see this interview).
Goals
These are the three main outcome goals we are working on to increase the cultural diversity within the Wikimedia projects:
Main outcome goals:
- Every Wikipedia language edition ensures a minimal representation of their own territories’ cultural context (from geography to biographies, traditions, language and others).
- Every Wikipedia language edition ensures a minimal coverage of every other language cultural context content.
- Every Wikipedian has information about marginalized languages without a Wikipedia so he can help out their speakers to create one and start representing their cultural context.
In order to reach these goals, we detail some other more specific goals in community engagement and research and development activities of the project.
Community engagement goals:
- Every Wikipedia language community is aware and knows about the knowledge inequalities in the entire Wikipedia project.
- Every Wikipedia language community is aware of the importance of representing her own culture so the rest of language editions users can import and learn from it.
- Every Wikipedia event and community organized contest considers dedicating sections and activities aimed at mitigating the cultural knowledge gaps and derived inequalities.
Research and development goals:
- Every Wikipedian has access to some data visualization tools in order to browse the gaps and create new valuable articles.
- Every Wikipedian has access to some statistical analysis on the extent of the gaps and understands the priorities in order to bridge or cover them.
- Every Wikipedian has access to some data on the world's languages without a Wikipedia in order to disseminate the importance and try to engage in creating one.
Framing the problem
“The sum of human knowledge” is not in a single language but in the existing cultural diversity from every territory and language in the world. Wikipedia aims at gathering it.
In order for each Wikipedia language edition to have content representing something close to the world existing cultural diversity, we have to work on very different aspects and align all the Wikimedia movement stakeholders to facilitate the creation of content that ensures articles that show cultural diversity.
We see this as a two-step process or two sequential processes: representation and sharing.
For each language, the process of representation implies creating content that relates to the geographical and culture context from the editors. Instead, the process of sharing, implies understanding where the gaps are both in the own language and in the others, in order to exchange each others' cultural context content and increase all languages' cultural diversity.
In order to facilitate cultural context representation, we propose:
- Create, collect, process, and present the sorts of metrics which describe creation and usage statistics of cultural content on Wikimedia projects.
- Understand the situation of all the world's marginalized languages that could become Wikipedia language editions, and consequently, the potential content about their cultural context they would bring to the entire Wikipedia project.
In order to facilitate each language to share (import and export) cultural context content, we propose:
- Ideate and develop tools that prioritize and allows finding the most valuable content (popular and relevant) that might be essential to be created across projects.
- Provide training to organizations and individuals in these tools so that they can help mitigate the knowledge gaps and increase the cultural diversity in Wikimedia projects.
Cultural diversity tools
As an observatory, the outcomes of this project bridge the gap between research and activism more than focusing on the content creation itself. This portal itself provides results. Most of the visualizations are located or better depicted at an external website (wcdo.wmflabs.org) created with Plotly hosted in Toolforge.
Even though some results are repeated in both sites, those at the external website are preferable as they allow better user interaction with the data. For example, the tables from List of Wikipedias by Cultural Context Content allow filtering feature not available in List of Wikipedias by Cultural Context Content.
This project is continually developing research questions, concepts, visualizations and tools. DISCLAIMER: Currently it is in the beta 1, so if you find any bug, we would be so pleased to receive a report to the e-mail tools.wcdo@tools.wmflabs.org.
WCDO's main concepts are Cultural Context Content, Culture Gap and Top CCC articles lists:
Cultural Context Content (CCC) aka Local Content
Cultural Context Content (CCC) (methodology) is the group of articles in a Wikipedia language edition that relates to the editors' geographical and cultural context (places, traditions, language, politics, agriculture, biographies, events, etcetera.) (Figure 1). You can see this Youtube video explaning its creation and use.
In order to create any CCC it is necessary to establish a language territories mapping, in other words, to pin out the territories where the language is spoken as native or with official legal status.
Cultural Context Content is collected as a group of datasets (Figure 2), which are released on a monthly basis. These datasets are used to compute and depict several statistics on the state of knowledge equality and cross-cultural coverage.
For example, it is possible to consult the extent of CCC in each Wikipedia language edition (List of Wikipedias by Cultural Context Content) or even the amount of articles from a particular territory in one language edition CCC (List of Language Territories by Cultural Context Content).
Culture Gap
The culture gap occurs when a Wikipedia language edition is not covering articles that belong to another language edition CCC. Around a 50% of the articles non-existing across language editions (language gap) is due to the culture gap.
In order to compute the culture gap and other statistics, WCDO proposes calculating the intersections between differents sets of articles (e.g. common articles between all articles from English language edition and articles from Japanese CCC). The use of intersections allows to see the absolute number of articlese and its extent (the relative importance) in each of the two sets.
In these two tables it is possible to see the culture gap in two different ways. First, the spread of a language CCC on the rest of Wikipedia language editions, and, second, the coverage of all the languages CCC.
- Language culture gap (spread) or CCC spread.
- Language culture gap (coverage) and CCC coverage.
Top CCC articles lists
Wikipedia language editions should not be a replica of each other and the gap may never be completely closed. However, a minimal coverage of all other languages should be a goal on the agenda of each Wikipedia edition to create more multicultural (and complete) encyclopaedias.
Top CCC articles lists can help in providing content for this minimal cultural coverage. Inspired by the Vital articles lists, the Top CCC articles present the most rellevant articles in terms of different metrics (e.g. number of editors or pageviews) and specific content types (e.g. geolocated articles or women) from a language cultural context or country's cultural context.
The Top CCC articles current generaetd lists are: list of CCC articles with most number of editors (Editors), list of CCC articles with featured article distinction (Featured), most bytes and references (weights: 0.8, 0.1 and 0.1 respectively), list of CCC articles with geolocation with most links coming from CCC, list of CCC articles with keywords on title with most bytes (Bytes), list of CCC articles categorized in Wikidata as women with most edits (Women), list of CCC articles categorized in Wikidata as men with most edits (Men), list of CCC articles created during the first three years and with most edits (First 3Y.), list of CCC articles created during the last year and with most edits (Last Y.), list of CCC articles with most pageviews during the last month (Pageviews), list of CCC articles with most edits in talk pages (Discussions).
In this page, you can consult the list from a particular country or language CCC generated on a monthly basis from the latest CCC dataset. You need to specify the list parameter (editors, featured, geolocated, keywords, women, men, created_first_three_years, created_last_year, pageviews and discussions), the language target parameter (as lang_target and the language wikicode), the language origin (as lang_origin and the language wikicode), and, optionally to limit the scope of the selection, the country origin parameter as part of the CCC (as country_origin and the country ISO3166 code). In case no country is selected, the default is 'all'.
One possible URL with Top CCC list by number of editors, language origin Spanish, language target Italian and no country would be: https://wcdo.wmflabs.org/top_ccc_articles/?list=editors&source_lang=es&target_lang=it
A similar list but limited to a specific country and to women, would be:
The generated table includes several metrics, and shows the availability in top right column with the current title (in case it exists) or one possible title generated by translator or by a Wikidata label.
Another way to browse the lists is by examining how well a language edition covers the other language editions Top CCC articles lists (centered around countries, as Countries Top CCC article lists), or how well spread are one particular language editions Top CCC lists on the rest of language editions.
In this case, it is necessary to specify the language covering or spreading the lists with the lang parameter. This is an example using Catalan Wikipedia:
- Languages Top CCC articles spread from Catalan Wikipedia.
https://wcdo.wmflabs.org/languages_top_ccc_articles_spread/?lang=ca
- Languages Top CCC articles coverage by Catalan Wikipedia.
https://wcdo.wmflabs.org/languages_top_ccc_articles_coverage/?lang=ca
- Countries Top CCC articles coverage by Catalan Wikipedia.
https://wcdo.wmflabs.org/countries_top_ccc_articles_coverage/?lang=ca
Missing CCC articles
Normally Wikipedia language editions tend to cover their own cultural context (from territories to all the cultural expressions) much better than others. However, in around 150 languages their cultural context content is below the 10% of the content, which is a sign that it is likely underrepresented. In this case, it very possible that larger Wikipedia language editions have articles that are missing in their CCC. Sometimes these languages are English, French Russian and Spanish, which are the languages that usually coexist with other languages with a Wikipedia (only 48 Wikipedia language editions are of languages that do not coexist with other languages in one territory).
In order to improve the representation of local content in these underdeveloped Wikipedias, we proposed the creation of a tool named "Missing CCC articles. This allows us to query articles that should exist in one language CCC but they have not been created yet, and instead, exist in other languages. Additionally, we can also query articles from a language CCC that are longer in another language edition.
It is possible to query any list by changing the URL parameters or by using the following menus. You first need to select the target language (where you would like to improve local content representation). Additionally, if you want to aim at specific part of a language context, you can select the target country and target region - they are optional and allow you to filter for a specific area. For instance, for Target language French, whose language context encompasses several countries, Target country and Target region could be France and Québec.
One possible URL with a query for Luganda CCC about Uganda and Geolocated content that is found in any other language edition would be:
Disclaimer: This tool is still at Alpha phase and may contain some bugs. Your feedback can be useful.
New tools (work in progress)
Current we want to use the CCC datasets to monitor the gaps on a continual basis (showing the creation of articles for specific kinds of content to show whether and where editors are really bridging the gap) along with many other lists, solutions and improvements after all the feedback gathered in past Wikimedia events and from local communities (Figure 4). Likewise, we want to create a multilingual editors dashboard where to find potential collaborators. The editor must be able to query lists or visualizations where to see editors from other language editions or his and their cultural context interests.
Other diversity tools and research papers
We also want to provide a short ovierview on the different other tools and research papers created that are useful to understand and detect cultural differences between language editions and possibly bridge the gaps.
Strategic discussions
This project also aims at raising debates on the different types of diversity. Some of the Wikimedia 2030 Strategy process discussions in the Diversity Working group are directed at improving diversity on content and in the current communities.
Activities / Get Involved
The project has been presented at different venues as a concept and in its beta phases. It does need dissemination in order to reach all the possible Wikimedia events and activities where it could provide some value.
This page here is the central hub for the research and technical documentation, and at the same time, it directs to the visualizations.
If you want to collaborate, get involved. In case you want to code some extra visualizations, you can find the project's code here: github page.