Research:Newsletter/2024/May: Difference between revisions
Content deleted Content added
ce |
No edit summary |
||
Line 20:
{{WRN}}
===Actually, Wikipedia was not killed by ChatGPT - but might it be growing a little less because of it===
A preprint<ref>{{Cite| publisher = arXiv| doi = 10.48550/arXiv.2405.10205| last1 = Reeves| first1 = Neal| last2 = Yin| first2 = Wenjie| last3 = Simperl| first3 = Elena| title = Exploring the Impact of ChatGPT on Wikipedia Engagement| date = 2024-05-22| url = http://arxiv.org/abs/2405.10205}}</ref> by three researchers from King’s College London tries to identify the impact of the November 2022 launch of [[ChatGPT]] on "Wikipedia user metrics across four areas: page views, unique visitor numbers, edit counts and editor numbers within twelve language instances of Wikipedia." The analysis concludes that
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"any impact has been limited and while [ChatGPT] may have led to lower growth in engagement [i.e. Wikipedia pageviews and ] within the territories where it is available, there has been no significant drop in usage or editing behaviours"
</blockquote>
The authors note that there are good a priori reasons to hypothesize that ChatGPT may have replaced Wikipedia for some usages:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"At this time, there is limited published research which demonstrates how and why users have been engaging with ChatGPT, but early indications would suggest users are turning to it in place of other information gathering tools such as search engines [...]. Indeed, question answering, search and recommendation are key functionalities of large language models identified in within the literature [...]"
</blockquote>
However, like many other current concerns about AI, these have been speculative and anecdotal. Hence the value of a quantitative analysis that tries to identify the causal impact of ChatGPT on Wikipedia in a statistically rigorous manner. Without conducting experiments though, i.e. based on observational data alone, it is not easy to establish that particular change or external event caused persistent increases or decreases in Wikipedia usage overall (as opposed to one-time spikes from particular events, or [[m:Research:Newsletter/2019/November#Seasonality_in_pageviews_reflects_plants_blooming_and_birds_migrating|recurring seasonal changes]]). The paper's literature review section cites only one previous publication which achieved that for Wikipedia pageviews: a 2019 paper by three authors from the Wikimedia Foundation (see our earlier coverage: [[m:Research:Newsletter/2019/December#An_awareness_campaign_in_India_did_not_affect_Wikipedia_pageviews,_but_a_new_software_feature_did|"An awareness campaign in India did not affect Wikipedia pageviews, but a new software feature did"]]), which used a fairly sophisticated statistical approach ([[Bayesian structural time series]]) to first create a counterfactual forecast of Wikipedia traffic in a world where the event in question did not happen, and then interpret the difference between that forecast and the actual traffic as related to the event's impact. Their method successfully estimated the impact of a software change (consistent with the results of a previous randomized experiment conducted by this reviewer), as highlighted by the authors of the present paper: "Technological changes can [...] have significant and pervasive changes in user behaviour as demonstrated by the significant and persistent drop in pageviews observed in 2014 [sic, actually 2018] when Wikipedia introduced a page preview feature allowing desktop users to explore Wikipedia content without following links." The WMF authors concluded their 2019 paper by expressing the hope that "it lays the groundwork for exploring more standardized methods of predicting trends such as page views on Wikipedia with the goal of understanding the effect of external events."
In contrast, the present paper starts out with a fairly crude statistical method.
First,
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
We gathered data for twelve languages from the Wikipedia API covering a period of twenty two months between the 1st of January 2021 and the 1st of January 2024. This includes a period of approximately one year following the date on which ChatGPT was initially released on the 30th of November 2022.
</blockquote>
(The paper does not state which 22 months of the 36 months in that timespan were included.)
The 12 Wikipedia languages were
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"selected to ensure geographic diversity covering both the global north and south. When selecting languages, we looked at three key factors:<br>
1. The common crawl size of the [[GPT-3]] main training data as a proxy for the effectiveness of ChatGPT in that language.<br>
2. The number of Wikipedia articles in that language.<br>
3. The number of global first and second language speakers of that language.<br>
We aimed to contrast languages with differing numbers of global speakers and languages with differing numbers of Wikipedia articles [...ending up with English, Urdu, Swahili, Arabic, Italian and Swedish].
As a comparison, we also analysed six languages selected from countries where ChatGPT is banned, restricted or otherwise unavailable [Amharic, Farsi, Russian, Tigrinya, Uzbek and Vietnamese].
</blockquote>
Then, "[a]s a first step to assess any impact from the release of ChatGPT, we performed paired statistical tests comparing aggregated statistics for each language for a period before and after release" (the paper leaves it unclear how long these periods were). E.g.
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"For page views, we first performed a two-sided [[Wilcoxon rank-sum test|Wilcoxon Rank Sum test]] to identify whether there was a difference between the two periods (regardless of
directionality). We found a statistically significant different for five of the six languages where ChatGPT was available and two of the six languages where it was not. However, when repeating this test with a one-sided test to identify if views in the period after release were lower than views in the period before release, we identified a statistically significant result in Swedish, but not for the remaining 11 languages."
</blockquote>
For the other three metrics (unique users, active editors, and edits) the results were similarly ambiguous, motivating the authors to resort to a somewhat more elaborate approach:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"While the [[Wilcoxon signed-rank test|Wilcoxon Signed-Rank test]]<!--sic--> provided weak evidence for changes among the languages before and after the release of ChatGPT, we note ambiguities in the findings and limited accounting for seasonality. To address this and better evaluate any impact, we performed a [[panel regression]] using data for each of the four metrics. Additionally, to account for longer-term trends, we expanded our sample period to cover a period of three years with data from the 1st of January in 2021 to the 1st of January 2024."
</blockquote>
While this second method accounts for weekly and yearly seasonality, it too does not attempt to disentangle the impact of ChatGPT from ongoing longer term trends. (While the given regression formula includes a language-specific [[fixed effect]], it doesn't have one for the availability of ChatGPT in that language, and also no slope term.) The usage of Wikipedia might well have been decreasing or increasing steadily during those three years for other reasons (say the basic fact that every year, the number of Internet users worldwide [https://ourworldindata.org/grapher/number-of-internet-users?country=~OWID_WRL increases by hundreds of millions]). Indeed, a naive application of the method would yield the counter-intuitive conclusion that ChatGPT ''increased'' Wikipedia traffic in those languages where it was available:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
For all six languages, [using panel regression] we found a statistically significant difference in page views associated with whether ChatGPT had launched when controlling for day of the week and week of the year. In five of the six languages, this was a positive effect with Arabic featuring the most significant rise (18.3%) and Swedish featuring the least (10.2%). The only language where a fall was observed was Swahili, where page views fell by 8.5% according to our model. However, Swahili page viewing habits were much more sporadic and prone to outliers perhaps due to the low number of visits involved.
</blockquote>
To avoid this fallacy (and partially address the aforementioned lack of trend analysis), the authors apply the same method to their (so to speak) [[control group]], i.e. "the six language versions of Wikipedia where ChatGPT is was unavailable":
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
Once again, results showed a statistically significant rise across five of the six languages. However, in contrast with the six languages where ChatGPT was available, these rises were generally much more significant. For Farsi, for example, our model showed a 30.3% rise, while for Uzbek and Vietnamese we found a 20.0% and 20.7% rise respectively. In fact, four of the languages showed higher rises than all of the languages where ChatGPT was available except Arabic, while one was higher than all languages except Arabic and Italian.
</blockquote>
The authors stop short of attempting to use this difference (between generally larger pageview increases in ChatGPT-less languages and generally smaller increases for those where ChatGPT was available) to quantify the overall effect of ChatGPT directly, perhaps because such an estimation would become rather statistically involved and require additional assumptions. In the paper's "conclusions" sections, they frame this finding in vague, qualitative terms instead, by stating that ChatGPT {{tq|may have led to lower growth in engagement [pageviews] within the territories where it is available}}.
For the other three metrics studied (unique devices, active editors, and edits), the results appear to have been even less conclusive. E.g. for edits, "[p]anel regression results for the six languages were generally not statistically significant. Among the languages where a significant result was found, our model suggested a 23.7% rise in edits in Arabic, while for Urdu the model suggested a 21.8% fall."
In the "Conclusion" section, the authors summarize this as follows:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
Our findings suggest an increase in page visits and visitor numbers [i.e. page views and unique devices] that occurred across languages regardless of whether ChatGPT was available or not, although the observed increase was generally smaller in languages from countries where it was available. Conversely,
we found little evidence of any impact for edits and editor numbers. We conclude any impact has been limited and while it may have led to lower growth in engagement within the territories where it is available, there has been no significant drop in usage or editing behaviours.
</blockquote>
Unfortunately this preprint does not adhere to research best practices about providing replication data or code (let alone a [[preregistration]]), making it impossible to e.g. check whether the analysis of pageviews included automated traffic by spiders etc. (the default setting in the Wikimedia Foundation's [https://wikimedia.org/api/rest_v1/#/Pageviews%20data/get_metrics_pageviews_aggregate__project___access___agent___granularity___start___end_ Pageviews API]), which would considerably impact the interpretations of the results. The paper itself notes that such an attempt was made for edits ("we tried to limit the impact of bots by requesting only contributions from users") but doesn't address the analogous question for pageviews.
An earlier version of the paper as uploaded to ArXiv had the title "'The Death of Wikipedia?' -- Exploring the Impact of ChatGPT on Wikipedia Engagement", which was later shortened by removing the attention-grabbing "Death of Wikipedia". As explained in the paper itself, that term refers to "an anonymous Wikipedia editor’s fears that generative AI tools may lead to the death of Wikipedia" - specifically, the essay [[User:Barkeep49/Death of Wikipedia]], via its mention in a July 2023 New York Times article, see [[Wikipedia:Wikipedia Signpost/2023-08-01/In the media]]. While the paper's analysis conclusively disproves that Wikipedia has died as of May 2024, it is worth noting that Barkeep49 did not necessarily predict the kind of immediate, lasting drop that the paper's methodology was designed to measure. In fact, the aforementioned NYT article quoted him as saying (in July 2023) "It wouldn’t surprise me if things are fine for the next three years [for Wikipedia] and then, all of a sudden, in Year 4 or 5, things drop off a cliff." Nevertheless, the paper's findings leave reason for doubt whether this will be the first of the many [[predictions of the end of Wikipedia]] to become true.
===Briefly===
* See the [[mw:Wikimedia Research/Showcase|page of the monthly '''Wikimedia Research Showcase''']] for videos and slides of past presentations.
===Other recent publications===
|