Template A/B testing/Results

This page is currently a draft. More information pertaining to this may be available on the talk page.

Translation admins: Normally, drafts should not be marked for translation.

This is a final report and summary of findings from more than six months of A/B testing user talk templates on Wikimedia projects. Conclusions are bolded below for easier skimming, but you're highly encouraged to compare the different versions of the templates and read our more detailed analysis.

Background

First messages to new editors on English Wikipedia

Why user talk templates matter

As Wikipedia projects have evolved over the past 11 years, templated messages have become an increasingly common tool for communicating to editors – especially to those that are new and anonymous. Prior to approximately 2006, most of the user talk page messages left for new editors were normal human conversation, customized for each person and for their contribution history. In the face of an enormous growth in the number of editors and the volume of communication needed between editors and on articles, the community developed several messaging systems -- which rely on templates -- to simplify and speed up communication about common issues.

The Wikimedia Summer of Research in 2011 provided new insight into the changing nature of communication on Wikipedia (Summary of Findings). Today, for example, close to 80% of first messages to new English Wikipedians are sent using a bot or semi-automated editing tool (see chart). This increase in automated template messages to new users strongly correlates with the decline in retention of new editors to Wikipedia.

Our hypothesis

We hypothesized that by changing the style, tone, and content of a user talk template message, we would be able to improve communication to new or anonymous editors, and thereby encourage positive contributions and/or discourage unwanted contributions.

How template A/B testing works

Two summer researchers, Aaron Halfaker and R. Stuart Geiger, designed the original experiment to test this hypothesis. The method used is roughly is as follows:

A switch parser function randomly selects one of the test or control templates based on time.
The chosen template is substituted on to the user talk page, and appears normal.
Blank templates transcluded in each test or control message allow for tracking receipt of templates.

Please note: These were live observational studies, not randomized controlled experiments.

How we analyzed the data

Some of our notes on plotting test data

We used a mixed method of analysis for these tests. Our base metric was whether new versions of templates led to more edits to articles by a contributor. Increasing the quantity of edits is not our only or primary goal, however, and most of the tests also required a special look at certain kinds of editing activity based on the context or activity for which an editor received a specific template message. For example, if a user was being warned for vandalism, increased editing activity (which could be more vandalism) would not necessarily a positive outcome.

In our early tests, we also spent a considerable amount of time categorizing the type and quality of contributions by hand. Later, we focused more on quantitative ways to focus on changes in the activity of good faith editors. For example, controlling for the relative experience of an editor before they were warned greatly increased our statistical confidence in the results.

Important note: for all tests where we did not do any hand-checking for quality, we did not include editors who went on to be blocked for any period of time – any increase or decrease in editing activity was among editors who were not vandalism-only accounts. We also always focused on editors for whom this was their first warning.

Warnings about vandalism

Vandalism warnings: not as simple as you might think

Huggle is a powerful vandal-fighting tool used by many editors on English, German, Portuguese, and other Wikipedias. The speed and ease of using Huggle means that reverts and warnings can be issued in very high numbers, so it is not surprising that Huggle warnings account for a significant proportion of first messages to new editors.

Together with the original Huggle test run during the Summer of Research, we ran a total of six Huggle tests in two Wikipedias -- English and Portuguese. For further reading on our data analysis, see our journal notes for Huggle.

English

First test of general vandalism warnings via Huggle

This experiment tested many different variables in the standard level one vandalism warning: image versus no image; teaching, personalization, or a mix of both. The results showed us that it is better to test a small number of variables, but there were some positive suggestions that increasing personalization led to a better outcome. More detailed analysis and results can be found here.

Second test of general vandalism warnings via Huggle

Iterating on the results from the first Huggle test, we tested the default level one vandalism warning against the personalized warning written by Aaron and Stuart, as well as our own even more personalized version, which also removed some of the "directives"-style language and links to general policy pages. Our results confidently showed that the personalized and no-directives warning did best at retaining good-faith users (who were not subsequently blocked) in the short term. However, as time passed (and, presumably, those users received more standard template messages on their talk that were not personalized for other editing activities), retention rates for the control and test groups converged.

Third test of general vandalism warnings via Huggle

In our last test of vandalism warnings, we tested our best performing template message from the Huggle 2 test against the control and a very short test message written by a volunteer. Results showed that both test templates did better at retaining good-faith users in the short term in the article space, though the personal/no-directives version was a little more successful than the short message at getting users to edit outside of the article space (e.g., user talk).

Read about an editor from this test.

Conclusions

As we hypothesized, changing the tone and language of the generic vandalism warning especially..

increasing the personalization (active voice rather than passive, explicitly stating that the sender of the warning is also a volunteer editor, including an explicit invitation to contact them with questions);
decreasing the number of directives and links (e.g., "use the sandbox," "provide an edit summary");
and decreasing the length of the message;

...led to more users editing articles in the short term (0 to 3 days after receiving the warning).

Portuguese

Working with community members from Portuguese Wikipedia, we designed a test that was similar to our second Huggle experiment, testing a personalized, personalized and no directives, and control (default) level one vandalism warning. The small number of registered editors that were warned during the course of the test (half as many as the English Huggle 2 test) meant that the sample size was too small to test for effects of the templates. Unregistered editors warned were very unlikely to edit after the test, also making it difficult to see any effect from the templates.

Warnings about specific issues, such blanking, lack of sources, etc.

For further reading on our data analysis, see our journal notes for Huggle.

Read about an editor from this test.

First test of warnings about specific issues, via Huggle:

Based on our findings from previous experiments on the level one vandalism warning in Huggle, we decided to use our winning message strategy — a personalized message with no directives — to create new versions of all the other level one user warnings used by Huggle. These issue-specific warnings are used for warning messages about test content, spam, non-neutral POV, unsourced content, and attack edits, as well as unsourced edits to biographies of living people, purposefully inserting errors, blanking articles or deleting large portions of text without an explanation.

Conclusions

When we compared the outcome of the control group to the group that received personalized/no directives warnings, the templates had different effects on registered and unregistered users: for registered users who had made at least 5 edits before being warned, the test templates did better at getting people to edit articles. For unregistered editors, the effect was reversed, and the default version was more successful.

Our explanation for this is that the test style of the template, which was extremely friendly, reminded editors about the community aspect of encyclopedia building, and encouraged them to edit again to fix their mistakes, was only effective at registered editors who'd showed a minimum of commitment. The confidence in the result for registered editors also increased when looking at all namespaces, which supports the theory that the invitation to the warning editor's talk page or other links are helpful.

One of the other interesting results of this test is that some of these issue-specific warnings (such as the ones for attack and non-neutral edits) are never or rarely used, while some (test, spam, unsourced edits) are used very often. This suggests that the most common mistakes made by new or anonymous editors are not particularly malicious, and should be treated with good faith for the most part.

The test templates in this experiment were clearly more effective than the default for registered editors who'd made at least 5 edits before being warned.
Test versions of the templates in this test were not as effective as the current version for unregistered editors. For editors less experience, the results were muddled.

Second test of warnings about specific issues, via Huggle:

In our next test of issue-specific templates, we ran the control against a very short version of each message. Again, registered editors were retained more by the test (personalized and no directives) version of each warning and unregistered editors were retained more by the controls. Considering that the short version is still friendlier than control (starting with "Hello and welcome!" vs "Welcome to Wikipedia.") this generally supports our hypothesis. When we looked at edits in all namespaces (not just the article space), the test versions did better at retaining registered and anonymous editors. Important to note is that confidence for all namespaces was higher in the previous "nodirectives" test among registered editors, so the link to the user talk page etc. is likely to be helpful, and it was not present in the short test.

Both test templates, the longer and shorter versions, were far superior messages when directed at registered editors.
The test templates were not more effective for unregistered editors, though results were unclear for IPs with less than 5 edits before being warned.
When comparing just the shorter test template against the control, it was clearly superior at encouraging edits.

Warnings about external links

In addition to the test of spam warnings via Huggle, we experimented with new messages for XLinkBot, which warns users who insert inappropriate external links into articles. The s'results were inconclusive on a quantitative basis', with no significant retention-related effects for either registered or IP editors. One result we noted in this test was that the bot sends multiple warnings to IPs, which combine with Huggle/Twinkle-issued spam warnings and pile up on IP talk pages, making them hard to tease apart.

Though there was no statistically significant change in editor retention in the test, qualitative data suggests that:

more boldly declaring that the reverting editor was a bot that sometimes makes mistakes may encourage editors to revert it more or otherwise contest its actions.
it is clear that the approach used for vandalism warnings in prior tests may not be effective in the case of editors adding external links which are obviously unencyclopedic in some way (even if they are not always actually spam). These complex issues require very specific feedback to editors.

Warnings about test edits

Warnings about files

In this test, we worked with English Wikipedia's ImageTaggingBot to test warnings delivered to editors that upload files which are missing some or all of the vital information about licensing and source. You can read more about the data analysis at our journal notes.

Warnings about copyright

In this test, we worked with English Wikipedia's CorenSearchBot to test different kinds of warnings to editors whose articles are very likely to be copied in part or wholesale from another website. CorenSearchBot relies on a Yahoo! API and is current no longer in operation, but during its prime work was the English Wikipedia's number one identifier of copyright violations in new articles. Read more about our data analysis in our journal notes.

Deletion notifications

Types of User Talk edits made using Twinkle, June 1-September 9th 2011

In this test, we attempted to rewrite the language of two of the three kinds of deletion notices used on English Wikipedia, to test whether clearer, friendlier language was better at retaining users and helping them properly contest a deletion.

We also asked for some qualitative feedback from the users who had been affected by this test and had continued to edit months later. For further reading on our data analysis, see our journal notes for Twinkle.

Articles for deletion (AFD)

The goal of the new version we wrote for the AFD notification was to gently explain what AFD was, and to encourage authors of articles to go and add their perspective to the discussion.

For a quantitative analysis, we looked at editing activity in the Wikipedia (or "project") namespace among editors in both the test and control group, and found no statistically significant difference in the amount of activity there.

Since AFD discussions occur in this area, we've concluded that there is likely no significant positive change in the new template's ability to motivate editors to participate in deletion discussions. AFD is a complicated and somewhat intimidating venue for discussion to someone inexperienced with its norms, and how difficult it was to motivate editors to participate their more is not a surprise.

However, qualitative feedback we received from new editors suggested that it was an improvement simply to remove the number of excess links that are present in the current AFD notification – included in the default are links to a list of all policies and guidelines and documentation on all deletion policies.

Proposed deletion (PROD)

Read about an editor from this test

Since the primary method for objecting to deletion via a PROD is simply to remove the template, we used the API to search for edits by contributors in our test and control groups for this action (shortly after receiving the message, not over all time).

When looking at revisions to articles not deleted, there was no significant difference:

16 (2.8%) of the 557 editors who received the test message removed a PROD tag.
13 (2.2%) of the 579 editors who received the control message removed a PROD tag.

So while there was a small increase in the amount of removed PROD templates in the test group, it was not statistically significant.

However, when examining deleted revisions (i.e. articles that were successfully deleted via any method), we see a stronger suggestion that the new template more effectively told editors how and why to remove a PROD template:

96 out of 550 (17%) users whose articles were deleted removed the PROD tag if they got the test message.
70 out of 550 users (13%) removed the PROD tag if they got the standard message we used as a control.

Note that the absolute numbers are also a lot higher than in non-deleted revisions (where roughly 2% of each groups removed the tag). This makes sense – most new articles on Wikipedia get deleted in one way or another.

These results match the qualitative feedback we received from Wikipedians about these templates: the new version was much clearer when it came to instructing new editors how to object to a PROD, but it did not prepare them for the likelihood their article was likely to be nominated via another method if PROD failed to stick.

Speedy deletion (CSD)

SDPatrolBot is a bot that warns editors for removing a speedy deletion tag from an article they started. We hypothesized that the main reason this happens is not a bad-faith action on the part of the article author, but because the author is new to Wikipedia and doesn't understand how to properly contest a speedy deletion. To test this hypothesis, we created warnings that focused less on reproaching the user and more on inviting them to contest the deletion on the article's talk page.

The results showed that the new template did succeed in getting authors to edit the talk page rather than remove the speedy deletion tag, but only for those authors who had made many edits to their article before getting warned' (i.e., those who had worked hard on the article and were probably more invested in contesting the deletion). One unexpected effect of the test, however, was that, overall, the new warning made users more likely to be warned again by SDPatrolBot – i.e., to remove the speedy deletion tag a second time.

When we asked for qualitative feedback about the template from the users who were still editing a month after the test, we noticed that some users were confused by the friendly tone of the template, in stark contrast to all the other deletion notices they had received. They were also confused by the fact that one deletion process (Proposed deletion) invites users to remove a deletion tag from the article, but the other two (Articles for deletion and Speedy deletion) warn them that they may be blocked for performing this action.

Read about an editor from this test.

Welcome messages

Portuguese Wikipedia

German Wikipedia

The German Wikipedia welcome test was designed to measure whether reducing the length and number of links in the standard welcome message would lead new users to edit more. Though the sample size was very small (about 90 users total), the results were encouraging:

64% of editors who received the standard welcome message went on to make at least one edit on German Wikipedia;
88% of those who received the same welcome message, but with fewer links went on to edit;
84% who received a very short, informal welcome message went on to edit;

Overall, these results support our findings from other tests: Welcome messages that are relatively short, to the point, and which contain only a few important links are more effective at encouraging new editors to get involved in the project.

Wikimedia Incubator

Other talk page-related findings

Though we began this testing project thinking in terms of new users, it quickly became clear over the course of many of these tests – especially Twinkle, ImageTaggingBot, and CorenSearchBot – that user warnings and deletion notices affect a huge number of long-time Wikipedians, too. Yet despite the considerable difference in Wikipedia knowledge between a brand-new user and someone who's been editing for years, in most cases people get the same message regardless of their experience.

Credits

This study was run by Maryana Pinchuk and Steven Walling, with data analysis led by Ryan Faulkner. The testing methodology was created by Aaron Halfaker and R. Stuart Geiger, who also performed other key analyses. However, these tests would not have been possible with help from many Wikimedians, especially:

Template A/B testing/Results

Contents

Background

Why user talk templates matter

Our hypothesis

How template A/B testing works

How we analyzed the data

Warnings about vandalism

English

Portuguese

Warnings about specific issues, such blanking, lack of sources, etc.

Warnings about external links

Warnings about test edits

Warnings about files

Warnings about copyright

Deletion notifications

Articles for deletion (AFD)

Proposed deletion (PROD)

Speedy deletion (CSD)

Welcome messages

Portuguese Wikipedia

German Wikipedia

Wikimedia Incubator

Other talk page-related findings

Credits

Future work

See also