User:Jeblad/Better support for imported data in wikitext/Motivation

Motivation for use of a NLP engine is an attempt to describe why a NLP engine is necessary to make a simple interface for including external stored data in the content, and in particular how it can be done with Stanford CoreNLP. It is not an attempt to describe a complete system.

An example

Assume that a material value shall be inserted into a text. This material value can come from anywhere, but in Wikipedia it will most likely come from Wikidata. It will be value (object) from a statement, and as such can carry several related statements. For now assume that it only has a single value, and as an example assume the subject is w:Bouvet Island – a small island in the South Atlantic. This place is currently uninhabited, but in the future some researchers might live there. For reference the item for the island is Q1576537 and the property for population is P1082.

A text string like

The island has {{#plural:{{#property:P1082}}|0=no inhabitants|1=one inhabitant|{{#property:P1082}} inhabitants}}.

is then added in anticipation that the island might have some inhabitants in the future. A string like that is pretty unwieldy for common editors, but it become even more unwieldy if a similar string is used in another article like w:List of possessions of Norway, then it would be something like

The Bouvet Island has {{#plural:{{#property:P1082|from=Q1576537}}|0=no inhabitants|1=one inhabitant|{{#property:P1082|from=Q1576537}}}} inhabitants}}.

This quickly becomes extremely cluttered, and the chance that editors will use this form is very small. We need a simpler way to identify both the predicate and the subject.

The predicate

The predicate can be identified in a much easier way by adding some kind of simple delimiter and using the natural names. For Wikidata this is nothing more than the label and alias. Then we have something like

The island has {{#plural:{population}|0=no inhabitants|1=one inhabitant|{population} inhabitants}}.

We can infact do this already, as the property parser function can take labels instead of the property identificator. The form is slightly more readable, but it still create problems for the average editor.

Now assume that our marker text not only marks the predicate, but also fire a code snippet for the specific term, and that this code snippet can inspect neighboring words and terms. Then the reason why we use the plural parser function will go away, and we can write.

The island has {population} inhabitants.

because we can strip of the plural "s" from the following word. This will work most of the time, but the number of exceptions will be rather large.

A problem would be that the same code snippet will run whenever {population} is found. That is rather awkward, as we want to avoid branching and other kinds of exceptions in our code. Assume that the text is tagged, and that the surrounding tags around our markers are used in the lookup of the code snippets to run. There are several ways to do such a lookup, but as there can be several matches they must be ranked according to a language model. One such match could be for "inhabitants", or more accurately the tag attached to "inhabitants".

To actually change "inhabitants" we should not do that directly, but by adding new tags to a word or phrase. Those added tags are handled in a separate pass that is done after Lua is run. A word (or phrase) tagged as plural during analysis can then be marked as singular before synthesis (realization), and the form

The island has one inhabitant.

used if there is a single researcher on the island.

The subject

If we try to refer a statement from some other article the present syntax makes the text really cluttered, mostly because the subject is repeatedly identified. If we could utilize something like a referrer chain, that is call a utility function from our snippet of code, and then use this to inspect the chain of referring expressions, we can then use this to get to the final entity. In our example

The Bouvet Island has {population} inhabitants.

the "Bouvet Island" will be identified as the entity, and population will then be extracted from this entity.

In Stanford CoreNLP there is a subsystem for OpenIE, and this subsystem builds referrer lists. The frame for our called code snippet will then hold a reference to present position in the text, and calling the convenience function with this frame information would then start a traversal of the referrer list. There are several ways to do this, and it is not quite clear which one is the best one.

Normal use of the convenience function will simply return the highest ranked entity, given a language model for the referrer, but the function could also take a predicate that must exist and try to filter out an entity that contain this entity in a statement. For this to work properly there should also be some kind of limit on how likely the entity must be, otherwise the search for a match will possibly be unbound. A related problem is if and how sentence boundaries should be crossed.

Aliases

If the found entities are ranked according to some language model, then conflicts on aliases can be solved and we can use them in addition to labels. That can make the text flow much more naturally. A text like

The Bouvet Island has {population} inhabitants.

reads easily, but in some contexts it is perhaps more readable to have something like The population of Bouvet Island is now {inhabitants}. If we read sentence and stumble across a term used both as a kind of constant and as a sort of variable it can be very confusing. Make it as simple for the editors as possible.

Possible problems

Dirty text fragments

The generated text will not be clean, as the proposed solution would make it possible for the code snippets to change text outside the markers limits. That is the generated/changed content isn't clean. This can be counteracted by limiting the scope to a single paragraph, or the content of the enclosing element. It will although not completely solve the problem.

Not sure if this will escalate the problem with unbalanced tags inside templates.

Deferred edits

The text changes should not be done in such a way as to allow cross-talk between code snippets. This makes caching a lot more difficult. To avoid this the changes should go in a separate list of deferred edits, and this list is only evaluated after all code snippets are run. If the deferred list contains conflicting changes the text should be marked for further analysis.

Inflection of marker text

To make the text read more easily the terms used for the markers should perhaps be inflected properly. This can interfere with fast lookup. That is the marker text must be inflected into the same base form that is used on for example Wikidata.

Notes

It might not be clear that there are two shadow structures. One is the parse structure as it is generated before the Lua code is run, and the other is a list of edit operations that should be done on the text after the replacement of the marker text itself.

Risks

It can be to difficult to stop cross-talk between two invocations of code snippets that hits the same text. The reason is that it can be necessary for them to agree on the final outcome. So far it seems like an option to create special rewrite rules for the deferred edits.
The load could easily be unmanageable. Lua scripts can be terminated as usual, but it is not clear that queries to the nlp-server can be throttled.
Security problems would mostly be requests to the nlp-server, as they will be created by the editors through the provided text.

External links

Stanford NLP
The Stanford CoreNLP Natural Language Processing Toolkit [1]
Stanford CoreNLP – a suite of core NLP tools
GitHub: stanfordnlp/CoreNLP
GitHub: vzhong/CoreNLP.lua
GitHub: drewfarris/corenlp-examples
Demo site (traditional)
Demo site (new)
Stanford: Software > SPIED
Leveraging Linguistic Structure For Open Domain Information Extraction [2]
Software > Stanford TokensRegex
Stackoverflow: How to parse a list of sentences?
Alon Eirew: How does NLP engines work in a nutshell
Alon Eirew: Analyzing Text Using Stanford CoreNLP
JavaCodeGeeks: Resolve coreference using Stanford CoreNLP
ayoungprogrammer's blog: A Natural Language Query Engine without Machine Learning
GitHub: ayoungprogrammer/Lango
Training a Swedish POS-tagger for Stanford CoreNLP
Stanford NLP - Professor Dan Jurafsky & Chris Manning (Some of the gurus!)
ReVerb: Open Information Extraction Software
Apache OpenNLP