Research:Test External AI Models for Integration into the Wikimedia Ecosystem

Tracked in Phabricator:
Task T369281

Tracked in Phabricator:
Task T377159

Contact

Wikimedia Foundation

Duration: 2024-07 – 2024-12

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

As part of our contributions to WMF's 2024-2025 Annual Plan, Research and collaborators are working on identifying which AI and ML technologies are ready for WMF to start testing with (at the feature, product, ... levels), among the sea of models that are out there and continue to be made available.

Hypothesis Text

Q1 Hypothesis

If we gather use cases from products and feature engineering managers about the use of AI in Wikimedia services for readers and contributors, we can determine if we should test and evaluate existing AI models with the goal of integrating them into product features, and if yes, generate a list of candidate models to test.

Q2 Hypothesis

If we test the accuracy and infrastructure constraints of 4 existing AI language models for 2 or more high-priority product use-cases, we will be able to write a report recommending at least one AI model that we can use for further tuning towards strategic product investments.

Methods and Tasks

Define and prioritize existing use cases for AI integration into products

See T370134

Use case definition

Gather documented use cases from Product teams based on past conversations and draft an initial list, organized by task type, intended audience and impact.
Conduct a set of interviews with 7 product leaders to gather their perspectives on AI product needs. This instruction revealed three use cases; OCR, image vandalism detection and talk page translation. Most of the previously identified cases were confirmed and refined based on their feedback, and we also gained early insights into high and low priority use cases.
Survey Product Managers for additional input. We asked 13 product managers to review the current list of use cases and identify top-priority, low-priority, and any missing cases. The ranked use cases largely aligned with the initial feedback from leaders. The top priorities included edit-check-related tasks, such as automatically assigning categories to articles (useful beyond edit checks) and identifying policy violations. Structured tasks and mobile-friendly features, like automatic article outlines and worklist generation, were also highly ranked, along with automated image tagging and descriptions.

Results after this stage are here

Use case prioritization

After reviewing responses from the above process, and after the model selection phase is completed, we rank and select use cases based on the following criteria:

Priority signaled: Has this use case mentioned during conversations with Product Leadership or as part of the PM survey as a top-priority use case?
Ai Strategy Alignment: Is the use case aligned with WMF’s new AI Strategy?
Model availability: Have we identified existing models developed externally during the model selection phase that can be applied to this use case, based on the criteria of effectiveness, multilingualism, infrastructure and openness?
Data Availability: Do we have enough labeled data to test? If not, what will it take to compile the necessary data for example, through crowdsourcing or manual evaluation?
Measurability: Can we, in practice, estimate the effectiveness of existing models on the proposed use cases based on quantitative indicators?

Define a set of criteria to identify existing models to test, and select candidate models for use-cases

We review literature on existing AI models to find good matches for each use-case defined above based specific criteria.

Criteria for selecting models to be tested

Effectiveness: Does the model have the potential to perform well for the specific use case based on previous research or similar tasks? Has the model been applied successfully to similar task or domains, demonstrating its effectiveness?
Multilingualism: Does the model support the languages required by the use case? In general, is the model designed to handle multiple languages, and has it been trained and tested on languages other than English?(intentionally trained/tested on multiple languages)?
Infrastructure: Can the model be hosted within our current infrastructure(e.g., LiftWing)? Is the model adaptable within our systems, and have similar models been successfully hosted before? ? Is there a contingency plan for hosting the model externally if necessary?
Openness: Is the model open-source or available through accessible platforms such as Hugging face? Does the model's licensing allow for testing, modification, and use in production if needed? Is there sufficient public documentation regarding the model’s architecture, training data, and other relevant aspects?

Define a protocol for external model evaluation

Test models on WMF infrastructure

Timeline

[Q1 24-25] Tasks 1 and 2 [Q2 24-25] Tasks 3 and 4

Results

TODO: Add initial results for each task when ready

Provisional List of Defined Product-AI Use-Cases

Macro-Category	Use Case	Audience	What could this help our movement learn/achieve?	Impact I	Impact II
Structured/Edit Tasks	Detect grammar / typos / misspellings Detect errors in text and propose ways to correct them	Contributors		Support Newcomers	Automate Patrolling
Structured/Edit Tasks	Detect valid categories for Wikipedia articles Given an article, recommend the top X categories that the articles could be tagged with	Contributors		Support Newcomers	Address Knowledge Gaps
Structured/Edit Tasks	Detect Policy violations: e.g., WP:NPOV; WP:NOR	Contributors	"Moderators can see 'non-neutral language' in an article highlighted automatically; edit checks for newcomers; Editors could accept suggested corrections to edits based on policies and norms"	Improve Content Integrity	Support Newcomers
T&S/Moderator Tools	Talk page tone detection Detect negative sentiments and harassment in talk page conversations	Readers and Contributors	Functionaries can see talk page tone and manner issues in contributor stats; Editors receive constructive feedback on their tone/manner in talk page discussions. A Reader can see if an article has an unusual debate profile on its talk pages.	Automate Workflows	Improve Content Integrity
Structured/Edit Tasks	Source verification Verify that the text in an article is supported by the source specified in its inline citation	Contributors	Use LLMs to find new or better sources for claims on Wikipedia	Improve Content Integrity	Automate Workflows
T&S/Moderator tools	Talk page summaries Generate summaries of talk pages highlighting the main points of discussion and the final consensus	Contributors	New editors can generate a summary of talk page dialog before joining the discussion. Moderators can generate a summary of a discussion	Automate Patrolling	Support Newcomers
Structured/Edit Tasks	Automatic article outlines Generate a structure of sections and subsections for a new article	Contributors	Editors can automatically generate outlines for articles they want to write.	Automate Workflows
Reader Tools	Article summaries Summarize the content of an article in a few sentences	Readers	Readers can browse summaries of articles related to the article they’re on; The platform provides an article summary API for first or 3rd party use	Improve Content Discovery	Retain New Readers
Structured/Edit Tasks	Wikipedia text generation from sources Generate sentences or paragraphs for a Wikipedia article based on existing reliable sources	Contributors	Achieve new content. Inspire new/old editors that prefer draft suggestions instead of editing from scratch. Use GenAI to generate suggestions to Wikipedia articles based on given source content.	Automate Workflows	Support Newcomers
Reader Tools	Text to speech Audio format for the encyclopedic content	Readers	Readers can access articles or content in audio format. This could be just pronunciation or full article audio.	Accessibility
Structured/Edit Tasks	Automated image metadata tagging Tag images on Wikipedia and Commons with relevant Wikidata items	Readers and Contributors	Commons users can search using intuitive key words and find images that have been tagged in arcane ways. Editors can browse and easily add images related to the topic they’re editing. + semi-automated image description generation for structured tasks	Improve Search	Automate Workflows
Reader Tools	Automated Q/A generation from Wikipedia articles Generate questions and answers that can help navigate the content of a Wikipedia article	Readers	Use AI to autogenerate quizzes on articles; Readers can see all the questions the article they're reading has answers to	Improve Content Discovery	Retain New Readers
WikiSource	Optical Character Recognition system Digitizing documents require an OCR system that works for all the languages we support.	Contributors	Knowledge processing tools like OCR helps volunteers to digitize documents for projects like Wikisource. These tools are not easy to find for low resource languages. Assisting them with right tools helps them to contribute more and save time	Automate Workflows	Address Knowledge Gaps
T&S/Moderator Tools	Image Vandalism detection Detect images that are maliciously added to articles	Contributors	Patrollers can visualize images that appear to be out-of-context or misplaced in wikipedia articles	Automate Workflows	Improve Content Integrity
T&S/Moderator Tools	Automated worklist generation Generate lists of articles that are relevant to an editor and that need improvement	Contributors	Editors have an automatically generated list of articles they've contributed to that need additional work	Automate Workflows
T&S/Moderator Tools	Edit Summaries Given an edit, generate a meaningful summary of what happened in the edit (and why)	Contributors	Editors can automatically get an Edit Summary generated from their edits. Patrollers can see a summary of a user's recent edits	Automate Workflows	Automate Patrolling
Reader Tools	Automated reading list generation Recommend relevant articles to read based on current reader interest	Readers	A Reader can access a list of "the next 5 things you might want to read, based on what you've already read this session".	Improve Content Discovery	Retain New Readers
Structured/Edit Tasks	Suggest Templates for a given editor Retrieve relevant templates for a Wikipedia articles	Contributors	Help us learn whether we can effectively and reliably suggest templates for users who want to inset a template on a page. Measured as: when a user views "suggested templates" they insert a suggested template 20% of the time	Support Newcomers	Automate Workflows
Structured/Edit Tasks	Policy discovery during editing Retrieve policies that are relevant to the current edit activities	Contributors	Editors can easily find documentation on policies and norms; Relevant policies are automatically surfaced to editors in the edit workflow; New editors can ask questions to get help with editing	Support Newcomers	Automate Workflows
Wishlist	Talk page translations See translation of messages if that is in a different language	Readers and Contributors	Discussion venues supports multilingualism. Can be helpful for ambassadors posting messages in various wikis, Meta wiki discussions, Community Wishlist discussions, strategy discussions and so on.	Community inclusivity	Multilingual discussions and communication
Reader Tools	Improve Natural Language Search Improve search so that people can ask questions in natural language	Readers	Readers can ask questions in natural language to retrieve answers and articles ; can pull out factoids from deep in Articles; can navigate directly to relevant anchor links in articles	Improve Search
Structured/Edit Tasks	De-orphaning articles (suggest articles related to orphans)	Contributors	Improve Content Discovery	Support Newcomers

Selected Use Cases

Task 1: NPOV violation detection

The second major area we identified as a pool of tasks for testing is Policy Violation detection, in support of product features such as Edit Checks. More specifically, we would like to see the extent to which existing LLMS are able to detect violations to the NPOV policy. While recent research ^[1] has shown that existing LLMs are not necessarily accurate do detect these kinds of policy violations, we aim at extending those tests to more languages, and compare with simpler baselines.

Task 2: Peacock behavior detection

This is a special case of Task 2, which should be more constrained and feasible, given that detecting Peacock behavior is strictly a language understanding problem. However, recent experiments showed that small LLMs are not sufficiently well trained to detect this type of behavior. We aim here to expand these experiments to more and larger models and languages.

Selected Models

We test how different AI models would perform on the selected use-cases. We chose language models that simulate two AI setup scenarios, and that meet three of the four model selection criteria: multilingual, open-source models that have been previously proved to be effective on similar tasks.

Scenario	Family	Parameters	Interface	Examples
Fine-tuned models	Small Language Models (SLM)	~ 180 M	Code	mBert
Generic large model	LLM - Compact	8B	Prompt/Code	Llama 3.1 8B
Generic large model	LLM - Large	70B	Prompt/Code	Llama 3.1 70B

Fine-tuned models

Given the proven accuracy for other editing tasks, we use existing multilingual transformers of limited size (180M), like mBERT and XLM-Roberta Longformer. These models became popular in 2019 as the first generation of large language models able to generate meaningful sequences of words. We will call them “Small language models” or “SLMs”. Different from modern LLMs, these models can only be accessed via code, and need to be finetuned (i.e., go through a light-weight retraining) for each new use-case. For NPOV detection, we finetune two versions of mBERT: on 10 top languages, and on English only. We also finetune XLM-Roberta in 10 languages. For Peacock Behaviour detection, we finetuned the same model versions on 7 languages.

Generic large models

Based on the model selection criteria and the availability of models for large-scale experimentation we choose to test two categories of modern Language Models, which we call LLMs. These models can be used off-the-shelf by programmatically prompting them to to classify and generate text on a variety of use-cases.

Compact LLMs. These are relatively small-sized (Max 8B parameters) large language models. They were proven useful for tasks such as AI-assisted coding, simple text classification and text generation, and our ML team has been able to easily load them on LiftWing. For both NPOV detection and Peacock Behaviour, we prompt LLama 3.1 8B and Mistral 0.3 7B. We also do some additional tests using the AYA 23 model, which has been proved useful for multilingual text summarization.

Larger LLMs. These are larger-sized (70B parameters) language models, which have proved more effective than the compact ones, but require a larger infrastructure for prompting and testing. Given the availability of models in our testing environment, we focus on two versions of Llama 70B with different levels of weight quantization (this allows us to test the extent to which quantization affects performance): Llama 3.1 70B-Instruct, Llama 3.0 70B-Instruct and its quantized version Llama 3 70B Instruct-Lite.

Evaluation Protocol

We evaluate the selected models on the 2 policy violation use-cases using the following criteria:

Precision: How precise are the models? Namely the likelihood that a policy violation detected by the model is an actual violation. This has to do with how the model has been trained and how mature the technology is.

Generalizability and bias: How well do models generalize across languages and topics? i.e., the model precision to detect violations in multiple languages and across topics. This depends on the nature of the training data and the abstraction capabilities of the model.

Performance: How fast are the models? This criteria quantifies the model speed and latency for policy violation detection on our existing infrastructure. This has to do with the model size in terms of parameters, the processing needed, and the hardware available.

Experimental Setup

Measuring accuracy and generalizability: Data

We use an extensive evaluation dataset to test the accuracy of the chosen models on policy violation tasks.

Method. We gather lists of templates that flag violation of POV and Peacock behavior. We collect pairs of positive/negative samples for each template, where a positive sample corresponds to the revision that adds a template, and a negative sample - to the first revision where the template is removed.

Languages. We collect this data for all languages covered by the AYA 23 model then exclude all languages with less than 1K pairs. 17 languages are left for NPOV and 10 for Peacock detection after this step.

Evaluation data. We sample 1k pairs for each remaining language as evaluation data for both SLMs and LLMs. A total of 34k (NPOV) and 20k (Peacock) evaluation samples are collected. For the English language, we stratify by article topic and recency.

Fine-tuning data. To finetune SLMs, we generate a training set including languages that have at least an additional 1K pairs remaining after excluding those used for the evaluation dataset.

Measuring accuracy and generalizability: Finetuning and Evaluating models

Method - fine tuned models. We fine-tune the SLMs chosen using the fine-tuning data, so that they can associate each input sample with a positive/negative label and provide a confidence score. We then evaluate the resulting models on the evaluation data for both tasks.

Method - generic models. We prompt compact and large LLMs to obtain a classification label on each input sample from our evaluation data. We convert the label into a positive/negative score and compute evaluation metrics.

Metrics. For the purposes of this high level report, we will use precision as our main metric. This measures the percentage of detected violations that were actually policy violations.

Environment. We fine-tune and evaluate SLMs on our analytics infrastructure. To test the accuracy of LLMs at our scale, we resort to an external inference platform, together.ai.

Measuring performance: Infrastructure stress tests

Method. For benchmarking LLM performance, we run experiments that simulate inference requests for the tasks mentioned above. Apart from the choice of model, experiments are configured using the following parameters:

Sequence Length: The amount of input tokens fed to the model for a single query.

Batch size: The number of input sequences (e.g. prompt for generation tasks) to process simultaneously. This acts as a rough proxy for concurrent requests from multiple users.

Task: The task to run inference for. Currently supported kinds are text classification and text generation. For generation tasks, another parameter is max new tokens i.e. the number of tokens to produce in response to a prompt.

A lightweight harness takes this experiment configuration and initializes the specified model (e.g. LLama 3.1 8B or Mistral 7B) for the given task. It also sets up RAM and VRAM usage tracking for the experiment. Then batch sizesequence length random tokens are generated as inputs for inference and the attention mask for these tokens is all 1s as all sequences in a batch are the same length and there are no padding tokens. These inputs are then passed to the model and performance metrics are written to a file once inference is done.

Metrics. For the purposes of this high level report, we will use latency as our main performance metric. This measures the time elapsed (in seconds) between the policy detection request and the reception and processing of the full response.

Evaluation Results and Findings

We ran extensive accuracy, generalizability and performance experiments following the methodology above. We report the main findings in this section.

Some policy detection tasks are hard for humans and machines alike. Specifically for NPOV, precision across the three model families is very low. Models overall do better when detecting Peacock behavior. This is because while understanding neutrality requires in-depth reasoning, peacock behavior can be detected via language features.
Compact LLMs do not work well on policy detection tasks, with model accuracy close to random for both tasks. Larger LLMs model policy detection better, however they are more prone to errors when classifying articles previously unseen (i.e., outside the training data)
Finetuned SLMs perform at-par with large LLMs for easier tasks, like peacock behavior detection, with precision standing at 0.66 for both classes of models. One main difference between those two classes of models is that finetuned models output a confidence score. This can be tuned to classify as “violations” only those instances with a high confidence score, improving the overall precision of the model (at the expense of recall). This flexibility is not available with modern large language models.
Generalizability across languages is limited. All families of models work better in English and other latin/european languages, but their performances drop significantly for Asian or Arabic languages.
Finetuned model response is practically real-time, larger LLMs require almost half a minute. In the best case scenario, we will have to wait around 30 seconds to get a response from one of the larger language models we tested. This time scales almost linearly with the number of concurrent requests to the same model. Conversely, smaller finetuned models are scalable across concurrent requests and require less than 1/100 of a second to provide responses.

Resources

References

↑ Ashkinaze, Joshua, Ruijia Guan, Laura Kurek, Eytan Adar, Ceren Budak, and Eric Gilbert. "Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms." arXiv preprint arXiv:2407.04183 (2024).

[1] Ashkinaze, Joshua, Ruijia Guan, Laura Kurek, Eytan Adar, Ceren Budak, and Eric Gilbert. "Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms." arXiv preprint arXiv:2407.04183 (2024).

[1]