Research:Newsletter/2024/May: Difference between revisions
Content deleted Content added
ce |
No edit summary |
||
(24 intermediate revisions by 6 users not shown) | |||
Line 1:
{{WRN|14|05|May 2024|ChatGPT did not kill Wikipedia, but might have reduced its growth}}'''By:''' [[:w:User:HaeB|Tilman Bayer]]
===Actually, Wikipedia was not killed by ChatGPT – but it might be growing a little less because of it===
A preprint<ref>{{Cite arXiv |eprint=2405.10205| last1 = Reeves| first1 = Neal| last2 = Yin| first2 = Wenjie| last3 = Simperl| first3 = Elena| title = Exploring the Impact of ChatGPT on Wikipedia Engagement| date = 2024-05-22| class = cs.HC}}</ref> by three researchers from King's College London tries to identify the impact of the November 2022 launch of [[w:ChatGPT|ChatGPT]] on "Wikipedia user metrics across four areas: page views, unique visitor numbers, edit counts and editor numbers within twelve language instances of Wikipedia." The analysis concludes that
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"any impact has been limited and while [ChatGPT] may have led to lower growth in engagement [i.e. Wikipedia pageviews] within the territories where it is available, there has been no significant drop in usage or editing behaviours"
</blockquote>
The authors note that there are good ''a priori'' reasons to hypothesize that ChatGPT may have replaced Wikipedia for some usages:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"At this time, there is limited published research which demonstrates how and why users have been engaging with ChatGPT, but early indications would suggest users are turning to it in place of other information gathering tools such as search engines [...]. Indeed, question answering, search and recommendation are key functionalities of large language models identified in within the literature [...]"
</blockquote>
However, like many other current concerns about AI, these have been speculative and anecdotal. Hence the value of a quantitative analysis that tries to identify the causal impact of ChatGPT on Wikipedia in a statistically rigorous manner. Without conducting experiments though, i.e. based on observational data alone, it is not easy to establish that particular change or external event caused persistent increases or decreases in Wikipedia usage overall (as opposed to one-time spikes from particular events, or [[Research:Newsletter/2019/November#Seasonality_in_pageviews_reflects_plants_blooming_and_birds_migrating|recurring seasonal changes]]). The paper's literature review section cites only one previous publication which achieved that for Wikipedia pageviews: a 2019 paper by three authors from the Wikimedia Foundation (see our earlier coverage: [[Research:Newsletter/2019/December#An_awareness_campaign_in_India_did_not_affect_Wikipedia_pageviews,_but_a_new_software_feature_did|"An awareness campaign in India did not affect Wikipedia pageviews, but a new software feature did"]]). They had used a fairly sophisticated statistical approach ([[w:Bayesian structural time series|Bayesian structural time series]]) to first create a counterfactual forecast of Wikipedia traffic in a world where the event in question did not happen, and then interpret the difference between that forecast and the actual traffic as related to the event's impact. Their method successfully estimated the impact of a software change (consistent with the results of a previous randomized experiment conducted by this reviewer), as highlighted by the authors of the present paper: "Technological changes can [...] have significant and pervasive changes in user behaviour as demonstrated by the significant and persistent drop in pageviews observed in 2014 [sic, actually 2018] when Wikipedia introduced a page preview feature allowing desktop users to explore Wikipedia content without following links." The WMF authors concluded their 2019 paper by expressing the hope that "it lays the groundwork for exploring more standardized methods of predicting trends such as page views on Wikipedia with the goal of understanding the effect of external events."
In contrast, the present paper starts out with a fairly crude statistical method.
First,
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
We gathered data for twelve languages from the Wikipedia API covering a period of twenty two months between the 1st of January 2021 and the 1st of January 2024. This includes a period of approximately one year following the date on which ChatGPT was initially released on the 30th of November 2022.
</blockquote>
(The paper does not state which 22 months of the 36 months in that timespan were included.)
The 12 Wikipedia languages were
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"selected to ensure geographic diversity covering both the global north and south. When selecting languages, we looked at three key factors:
# The common crawl size of the [[w:GPT-3|GPT-3]] main training data as a proxy for the effectiveness of ChatGPT in that language.
# The number of Wikipedia articles in that language.
# The number of global first and second language speakers of that language.
We aimed to contrast languages with differing numbers of global speakers and languages with differing numbers of Wikipedia articles [...ending up with English, Urdu, Swahili, Arabic, Italian and Swedish].
As a comparison, we also analysed six languages selected from countries where ChatGPT is banned, restricted or otherwise unavailable [Amharic, Farsi, Russian, Tigrinya, Uzbek and Vietnamese].
</blockquote>
Then, "[a]s a first step to assess any impact from the release of ChatGPT, we performed paired statistical tests comparing aggregated statistics for each language for a period before and after release" (the paper leaves it unclear how long these periods were). E.g.
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"For page views, we first performed a two-sided [[w:Wilcoxon rank-sum test|Wilcoxon Rank Sum test]] to identify whether there was a difference between the two periods (regardless of
directionality). We found a statistically significant different for five of the six languages where ChatGPT was available and two of the six languages where it was not. However, when repeating this test with a one-sided test to identify if views in the period after release were lower than views in the period before release, we identified a statistically significant result in Swedish, but not for the remaining 11 languages."
</blockquote>
For the other three metrics (unique users, active editors, and edits) the results were similarly ambiguous, motivating the authors to resort to a somewhat more elaborate approach:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"While the [[w:Wilcoxon signed-rank test|Wilcoxon Signed-Rank test]]<!--sic--> provided weak evidence for changes among the languages before and after the release of ChatGPT, we note ambiguities in the findings and limited accounting for seasonality. To address this and better evaluate any impact, we performed a [[w:panel regression|panel regression]] using data for each of the four metrics. Additionally, to account for longer-term trends, we expanded our sample period to cover a period of three years with data from the 1st of January in 2021 to the 1st of January 2024."
</blockquote>
While this second method accounts for weekly and yearly seasonality, it too does not attempt to disentangle the impact of ChatGPT from ongoing longer term trends. (While the given regression formula includes a language-specific [[w:fixed effect|fixed effect]], it doesn't have one for the availability of ChatGPT in that language, and also no slope term.) The usage of Wikipedia might well have been decreasing or increasing steadily during those three years for other reasons (say the basic fact that every year, the number of Internet users worldwide [https://ourworldindata.org/grapher/number-of-internet-users?country=~OWID_WRL increases by hundreds of millions]). Indeed, a naive application of the method would yield the counter-intuitive conclusion that ChatGPT ''increased'' Wikipedia traffic in those languages where it was available:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"For all six languages, [using panel regression] we found a statistically significant difference in page views associated with whether ChatGPT had launched when controlling for day of the week and week of the year. In five of the six languages, this was a positive effect with Arabic featuring the most significant rise (18.3%) and Swedish featuring the least (10.2%). The only language where a fall was observed was Swahili, where page views fell by 8.5% according to our model. However, Swahili page viewing habits were much more sporadic and prone to outliers perhaps due to the low number of visits involved."
</blockquote>
To avoid this fallacy (and partially address the aforementioned lack of trend analysis), the authors apply the same method to their (so to speak) [[w:control group|control group]], i.e. "the six language versions of Wikipedia where ChatGPT is was unavailable":
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"Once again, results showed a statistically significant rise across five of the six languages. However, in contrast with the six languages where ChatGPT was available, these rises were generally much more significant. For Farsi, for example, our model showed a 30.3% rise, while for Uzbek and Vietnamese we found a 20.0% and 20.7% rise respectively. In fact, four of the languages showed higher rises than all of the languages where ChatGPT was available except Arabic, while one was higher than all languages except Arabic and Italian."
</blockquote>
The authors stop short of attempting to use this difference (between generally larger pageview increases in ChatGPT-less languages and generally smaller increases for those where ChatGPT was available) to quantify the overall effect of ChatGPT directly, perhaps because such an estimation would become rather statistically involved and require additional assumptions. In the paper's "conclusions" sections, they frame this finding in vague, qualitative terms instead, by stating that ChatGPT {{tq|may have led to lower growth in engagement [pageviews] within the territories where it is available}}.
For the other three metrics studied (unique devices, active editors, and edits), the results appear to have been even less conclusive. E.g. for edits, "[p]anel regression results for the six languages were generally not statistically significant. Among the languages where a significant result was found, our model suggested a 23.7% rise in edits in Arabic, while for Urdu the model suggested a 21.8% fall."
In the "Conclusion" section, the authors summarize this as follows:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
Our findings suggest an increase in page visits and visitor numbers [i.e. page views and unique devices] that occurred across languages regardless of whether ChatGPT was available or not, although the observed increase was generally smaller in languages from countries where it was available. Conversely, we found little evidence of any impact for edits and editor numbers. We conclude any impact has been limited and while it may have led to lower growth in engagement within the territories where it is available, there has been no significant drop in usage or editing behaviours.
</blockquote>
Unfortunately this preprint does not adhere to research best practices about providing replication data or code (let alone a [[w:preregistration|preregistration]]), making it impossible to e.g. check whether the analysis of pageviews included automated traffic by spiders etc. (the default setting in the Wikimedia Foundation's [https://wikimedia.org/api/rest_v1/#/Pageviews%20data/get_metrics_pageviews_aggregate__project___access___agent___granularity___start___end_ Pageviews API]), which would considerably impact the interpretations of the results. The paper itself notes that such an attempt was made for edits ("we tried to limit the impact of bots by requesting only contributions from users") but doesn't address the analogous question for pageviews.
An earlier version of the paper as uploaded to ArXiv had the title "'The Death of Wikipedia?' – Exploring the Impact of ChatGPT on Wikipedia Engagement", which was later shortened by removing the attention-grabbing "Death of Wikipedia". As explained in the paper itself, that term refers to "an anonymous Wikipedia editor's fears that generative AI tools may lead to the death of Wikipedia" – specifically, the essay [[:w:User:Barkeep49/Death of Wikipedia]], via its mention in a ''New York Times'' article, see [[:w:Wikipedia:Wikipedia Signpost/2023-08-01/In the media]]. While the paper's analysis conclusively disproves that Wikipedia has died as of May 2024, it is worth noting that Barkeep49 did not necessarily predict the kind of immediate, lasting drop that the paper's methodology was designed to measure. In fact, the aforementioned NYT article quoted him as saying (in July 2023) "It wouldn't surprise me if things are fine for the next three years [for Wikipedia] and then, all of a sudden, in Year 4 or 5, things drop off a cliff." Nevertheless, the paper's findings leave reason for doubt whether this will be the first of the many [[w:predictions of the end of Wikipedia|predictions of the end of Wikipedia]] to become true.
===Briefly===
* See the [[mw:Wikimedia Research/Showcase|page of the monthly '''Wikimedia Research Showcase''']] for videos and slides of past presentations.
===Other recent publications===
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, [[
===="Do We Trust ChatGPT as much as Google Search and Wikipedia?"====
From the abstract
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"A focus group and interview study (N=14) revealed that thankfully not all users trust ChatGPT-generated information as much as Google Search and Wikipedia. It also shed light on the primary psychological considerations when trusting an online information source, namely perceived gatekeeping, and perceived information completeness. In addition, technological affordances such as interactivity and crowdsourcing were also found to be important for trust formation."
</blockquote>
From the paper:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"Among all three information sources, Google was the most trusted platform, favored by 57% of our participants, followed by Wikipedia, which was liked by 29% of our participants [...]. Four participants expressed that ChatGPT is less credible than Google because it does not disclose the original source of the information."
</blockquote>
It should be noted that the authors' relieved conclusion ("thankfully") is somewhat in contrast with the result of a larger scale blind experiment published last year in preprint form (see our coverage: "[[Research:Newsletter/2023/September#In_blind_test%2C_readers_prefer_ChatGPT_output_over_Wikipedia_articles_in_terms_of_clarity%2C_and_see_both_as_equally_credible|In blind test, readers prefer ChatGPT output over Wikipedia articles in terms of clarity, and see both as equally credible]]").
====WikiChat, "the first few-shot LLM-based chatbot that almost never hallucinates"====
[[File:All WikiChat components, and a sample conversation about an upcoming movie, edited for brevity.svg|thumb|center|650px|"All WikiChat components, and a sample conversation about an upcoming movie [Oppenheimer], edited for brevity. The steps taken to generate a response include (1) generating a query to retrieve from Wikipedia, (2) summarizing and filtering the retrieved passages, (3) generating a response from an LLM, (4) extracting claims from the LLM response (5) fact-checking the claims in the LLM response using retrieved evidence, (6) drafting a response, and (7) refining the response." (Figure 1 from the paper)]]
From the abstract of this paper (by three graduate students at Stanford University's computer science department and [[w:Monica S. Lam|Monica S. Lam]] as fourth author):<ref>{{Cite conference| publisher = Association for Computational Linguistics| doi = 10.18653/v1/2023.findings-emnlp.157| conference = [[w:EMNLP|EMNLP]] 2023| pages = 2387–2413| <!--editors = Houda Bouamor, Juan Pino, Kalika Bali (eds.)|--> last1 = Semnani| first1 = Sina| last2 = Yao| first2 = Violet| last3 = Zhang| first3 = Heidi| last4 = Lam| first4 = Monica| title = WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia| book-title = Findings of the Association for Computational Linguistics: EMNLP 2023| ___location = Singapore| date = December 2023| url = https://aclanthology.org/2023.findings-emnlp.157}} [https://github.com/stanford-oval/WikiChat Code]</ref>
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus. WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engaging responses. We distill WikiChat based on GPT-4 into a 7B-parameter [[w:LLaMA|LLaMA]] model with minimal loss of quality, to significantly improve its latency, cost and privacy, and facilitate research and deployment. [...] we show that our best system achieves 97.3% factual accuracy in simulated conversations. It significantly outperforms all retrieval-based and LLM-based baselines, and by 3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is also significantly more informative and engaging, just like an LLM. WikiChat achieves 97.9% factual accuracy in conversations with human users about recent topics, 55.0% better than GPT-4, while receiving significantly higher user ratings and more favorable comments."
</blockquote>
An online demo is available at https://wikichat.genie.stanford.edu/ . The [https://github.com/stanford-oval/WikiChat code] underlying the paper has been released under an open source license, and [https://github.com/stanford-oval/WikiChat?tab=readme-ov-file#run-a-distilled-model-for-lower-latency-and-cost two distilled models] (for running the chatbot locally without relying e.g. on OpenAI's API) have been published on Huggingface.
See also our review of a previous (preprint) version of this paper: "[[Research:Newsletter/2023/July#Wikipedia-based_LLM_chatbot_%22outperforms_all_baselines%22_regarding_factual_accuracy|Wikipedia-based LLM chatbot 'outperforms all baselines' regarding factual accuracy]]"
===="A Simple Model of Knowledge Scaffolding Applied to Wikipedia Growth"====
From the abstract:<ref>{{Cite journal| doi = 10.3390/fi15020067| issn = 1999-5903| volume = 15| issue = 2| pages = 67| last1 = Bagnoli| first1 = Franco| last2 = de Bonfioli Cavalcabo’| first2 = Guido| title = A Simple Model of Knowledge Scaffolding Applied to Wikipedia Growth| journal = Future Internet| date = February 2023|
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"We illustrate a simple model of knowledge scaffolding, based on the process of building a corpus of knowledge, each item of which is linked to “previous” ones. [...]. Our model can be used as a rough approximation to the asymptotic growth of Wikipedia, and indeed, actual data show a certain resemblance with our model. Assuming that the user base is growing, at beginning, in an exponential way, one can also recover the early phases of Wikipedia growth."
</blockquote>
[[File:The fundamental knowledge scaffolding model.png|thumb|center|550px|"The fundamental knowledge scaffolding model. (left) Knowledge bits are represented as nodes of a network, where different colors represent different levels and nodes at a certain level only depend on a certain number of nodes at lower levels. Green (basic) nodes represent axioms. (right) Observing the filling of the network (here with fixed width W and with fixed number of dependencies K), one can detect holes [e.g. content gaps on Wikipedia] that are filled after the appearance of nodes at higher levels." (from the paper)]]
===="males outperform females" when navigating Wikipedia under time pressure====
From the abstract:<ref>{{Cite journal| doi = 10.1038/s41598-024-58305-2| issn = 2045-2322| volume = 14| issue = 1| pages = 8331| last1 = Zhu| first1 = Manran| last2 = Yasseri| first2 = Taha| last3 = Kertész| first3 = János| title = Individual differences in knowledge network navigation| journal = Scientific Reports| date = 2024-04-09
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"we conducted an online experiment where participants played a navigation game on Wikipedia and completed personal information questionnaires. Our analysis shows that age negatively affects knowledge space navigation performance, while multilingualism enhances it. Under time pressure, participants’ performance improves across trials and males outperform females, an effect not observed in games without time pressure. In our experiment, successful route-finding is usually not related to abilities of innovative exploration of routes."</blockquote>
Line 79 ⟶ 126:
From the abstract:<ref>{{Cite thesis| publisher = University of Southampton| last = Kaffee| first = Lucie-Aimée| title = Multilinguality in knowledge graphs| date = October 2021| url = https://eprints.soton.ac.uk/456783/}}</ref>
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"In this thesis, we present studies to assess and improve the state of labels and languages in knowledge graphs and apply multilingual information. We propose ways to use multilingual knowledge graphs to reduce gaps in coverage between languages. We explore the current state of language distribution in knowledge graphs by developing a framework
</blockquote>
''See also [[mw:Extension:ArticlePlaceholder]] and our coverage of a subsequent paper: [[
===="Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs" such as Wikidata====
From the abstract:<ref>{{Cite conference| publisher = Association for Computational Linguistics| doi = 10.18653/v1/2023.emnlp-main.100| conference = EMNLP 2023| pages = 1612–1634| editor = Houda Bouamor, Juan Pino, Kalika Bali
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"Recent work in Natural Language Processing and Computer Vision has been using textual information – e.g., entity names and descriptions – available in knowledge graphs [such as Wikidata] to ground neural models to high-quality structured data. However, when it comes to non-English languages, the quantity and quality of textual information are comparatively scarce. To address this issue, we [...] i) bring to light the problem of increasing multilingual coverage and precision of entity names and descriptions in Wikidata; ii) demonstrate that state-of-the-art methods, namely, Machine Translation (MT), Web Search (WS), and Large Language Models (LLMs), struggle with this task; iii) present M-NTA, a novel unsupervised approach that combines MT, WS, and LLMs to generate high-quality textual information; and, iv) study the impact of increasing multilingual coverage and precision of non-English textual information in Entity Linking, Knowledge Graph Completion, and Question Answering. As part of our effort towards better multilingual knowledge graphs, we also introduce WikiKGE-10, the first human-curated benchmark to evaluate KGE approaches in 10 languages across 7 language families."
Line 98 ⟶ 146:
From the paper:<ref>Cedric Möller, Jens Lehmann, Ricardo Usbeck: [http://www.semantic-web-journal.net/content/survey-english-entity-linking-wikidata-0 Survey on English Entity Linking on Wikidata]. In: Semantic Web Journal, Special issue: Latest Advancements in Linguistic Linked Data, 2021; also as: {{cite arXiv | eprint = 2112.01989| last1 = Möller| first1 = Cedric| last2 = Lehmann| first2 = Jens| last3 = Usbeck| first3 = Ricardo| title = Survey on English Entity Linking on Wikidata| date = 2021-12-03 | class = cs.CL}} [https://github.com/semantic-systems/ELEnglishWD Code]</ref>
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"[[w:Entity Linking|Entity Linking]] (EL) is the task of connecting already marked mentions in an utterance to their corresponding entities in a knowledge graph (KG) [...]. In the past, this task was tackled by using popular knowledge bases such as DBpedia [67], Freebase [11] or Wikipedia. While the popularity of those is still imminent, another alternative, named Wikidata [120], appeared."
</blockquote>
From the abstract:
Line 108 ⟶ 156:
{{reflist|30em}}
{{WRN footer|14|05|May 2024|w:Wikipedia:Wikipedia Signpost/2024-06-08/Recent research}}
|