Wikidata/Notes/Change propagation/es: Difference between revisions
Content deleted Content added
Created page with "(Véase el comentario en la página de discusión sobre la tabla de cambios recientes frente a la tabla histórica)" |
Updating to match new version of source page |
||
(One intermediate revision by one other user not shown) | |||
Line 9:
== Resumen ==
<div lang="en" dir="ltr" class="mw-content-ltr">
* Each change on the repository is recorded in the changes table which acts as an update feed to any client wikis (like the Wikipedias).
* Dispatcher scripts periodically checks the changes table.
* Each client wiki is notified of any changes on the repository changes via an entry in its job queue. These jobs are used to invalidate and re-render the relevant page(s) on the client wiki
* Notifications about the changes are injected into the client's recentchanges table, to make them visible on watchlists, etc.
* Consecutive edits by the same user to the same data item can be combined into one, to avoid clutter.
</div>
<span id="Assumptions_and_Terminology"></span>
== Suposiciones y Terminología ==
<div lang="en" dir="ltr" class="mw-content-ltr">
The data managed by the Wikibase repository is structured into (data) entities. Every entity is
maintained as a wiki page containing structured data. There are several types of entities,
but one is particularly important in this context: items. Items are special in that they are
linked with article pages on each client wiki (e.g., each Wikipedia). For more information, see the
[[Wikidata/Notes/Data_model_primer|data model primer]].
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
The propagation mechanism is based on the assumption that each data item on the Wikidata
repository has at most one site link to each client wiki, and that only one item on the repository
can link to any given page on a given client wiki. That is, any page on any client wiki can be
associated with at most one data item on the repository.
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
: (See comment on discussion page about consequences of limiting change propagation to cases where Wikipedia page and Wikidata item have a 1:1 relation)
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
This mechanism also assumes that all wikis, the repository and the clients (i.e. Wikidata and the Wikipedias), can connect directly to each other's databases. Typically, this means that they reside in the same local network. However, the wikis may use separate database servers: wikis are grouped into sections, where each section has one master database and potentially many slave databases (together forming a database cluster).
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
Communication between the repository (Wikidata) and the clients (Wikipedias) is done via
an update feed. For now, this is implemented as a database table (the changes table)
which is accessed by the dispatcher scripts directly, using the "foreign database" mechanism.
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
Support for 3rd party clients, that is, client wikis and other consumers outside of Wikimedia,
is currently not essential and will not be implemented for now. It shall however be kept in
mind for all design decisions.
</div>
<span id="Change_Logging"></span>
== Registro de cambios ==
<div lang="en" dir="ltr" class="mw-content-ltr">
Every change performed on the repository is logged into a table (the "changes table", namely
wb_changes) in the repo's database. The changes table behaves similarly to MediaWiki's
recentchanges table, in that it only holds changes for a certain time (e.g. a day or a week), older
entries get purged periodically. As opposed to the recentchanges table however, wb_changes contains
all information necessary to report and replay the change on a client wiki: besides information
about when the change was made and by whom, it contains a structural diff against the entity's
previous revision.
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
Effectively, the changes table acts as an update feed. Care shall be taken to isolate the database
table as an implementation detail from the update feed, so it can later be replaced by an
alternative mechanism, such as PubHub or an event bus. Note however that a protocol
with queue semantics is not appropriate (it would require on queue per client).
</div>
<span id="Dispatching_Changes"></span>
== Distribución de cambios ==
<div lang="en" dir="ltr" class="mw-content-ltr">
Changes on the repository (e.g. wikidata.org) are dispatched to client wikis (e.g. Wikipedias) by
a dispatcher script. This script polls the repository's wb_changes table for changes, and dispatches
them to the client wikis by posting the appropriate jobs to the client's job queue.
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
The dispatcher script is designed in a way that allows any number of instances to run and share load without
any prior knowledge of each other. They are coordinated via the repoysitory's database using the
wb_changes_dispatch table:
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
* chd_client: the client's database name (primary key).
* chd_latest_change: the ID of the last change that was dispatched to the client.
* chd_touched: a timestamp indicating when updates have last been dispatched to the client.
* chd_lock_name: the name of the global lock used by the dispatcher currently updating that client (or NULL).
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
The dispatcher operates by going through the following steps:
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
# Lock and initialize
## Choose a client to update from the list of known clients.
## Start DB transaction on repo's master database.
## Read the given client's row from wb_changes_dispatch (if missing, assume chd_latest_change = 0).
## If chd_lock_name is not null, call IS_FREE_LOCK(chd_lock_name) on the ''client's'' master database.
## If that returns 0, another dispatcher is holding the lock. Exit (or try another client).
##
## Update the client's row in wb_changes_dispatch with the new lock name in chd_lock_name.
## Commit DB transaction on repo's master database.
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
# Perform the dispatch
## Get n changes with IDs > chd_latest_change from wb_changes in the repo's database. n is the configured batch size.
## Filter changes for those relevant to this client wiki (optional, and may prove tricky in complex cases, e.g. cached queries).
## Post the corresponding [[#Changes Notification Jobs|change notification jobs]] to the client wiki's job queue.
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
# Log and unlock
## Start DB transaction on repo's master database.
## Update the client's row in wb_changes_dispatch with chd_lock_name=NULL and updated chd_latest_change and chd_touched.
## Call RELEASE_LOCK() to release the global lock we were holding.
## Commit DB transaction on repo's master database.
</div>
Esto se puede repetir varias veces por un proceso, con un retraso configurable entre ejecuciones.
Line 99 ⟶ 131:
== Procesos para la notificación de cambios ==
<div lang="en" dir="ltr" class="mw-content-ltr">
The dispatcher posts changes notification jobs to the client wiki's job queue. These jobs contain a list
of wikidata changes. When processing such a job, the cleint wiki performs the following steps:
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
# If the client maintains a local cache of entity data, update it.
# Find which pages need to be re-rendered after the change. Invalidate them and purge them from the web caches. Optionally, schedule re-render (or link update) jobs, or even re-render the page directly.
# Find which pages have changes that do not need re-rendering of content, but influence the page output, and thus need purging of the web cached (this may at some point be the case for changes to language links).
# Inject notifications about relevant changes into the client's recentchanges table. For this, consecutive edits by the same user to the same item can be [[#Coalescing Events|coalesced]].
# Possibly also inject a "null-entry" into the respective pages' history, i.e. the revision table.
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
: (See comment on discussion page about recentchanges versus history table)
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
==
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
The system described above means several database writes for every change - and potentially many reads, depending on what is needed for rendering the page. And this happens on every client wiki (potentially hundreds) for every change on the repository. Since edits on the Wikibase repository tend to be very fine grained (like setting a label or adding a site link), this can quickly get problematic. Coalescing updates could help with this problem:
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
As explained in the [[#Dispatching Changes|Dispatching]] section, entries on the changes feed are processed in batches (per
default, no more than 100 entries at once).
</div>
<div lang="en" dir="ltr" class="mw-content-ltr">
If multiple changes to the same item are processed in the same batch, these changes can be coalesced together if they were all performed consecutively by the same user. This would reduce the number of times pages get invalidated (and thus eventually re-rendered. All the necessary entries in the recentchanges table (and possibly the revision table) can be inserted using a single database request. This process can be fine tuned by adjusting the batch size and delay between batches.
</div>
|