Content deleted Content added
m JorisvS moved page Native Language Identification to Native-language identification |
Weas3l5491 (talk | contribs) mNo edit summary |
||
(19 intermediate revisions by 16 users not shown) | |||
Line 1:
{{Short description|Determining someone's first language based on how they write or speak a different language}}
'''Native-language identification''' ('''NLI''') is the task of determining an author's [[first language|native language]] (L1) based only on their writings in a
== Overview ==
NLI works under the assumption that an author's L1 will dispose them towards particular language production patterns in their L2, as influenced by their
Using large-scale English data, NLI methods achieve over 80% accuracy in predicting the
==Applications==
===Pedagogy and
This identification of L1-specific features has been used to study [[language transfer]] effects in
===Forensic
NLI methods can also be applied in [[
This is particularly useful in situations where a text, e.g. an anonymous letter, is the key piece of evidence in an investigation and clues about the native language of a writer can help investigators in identifying the source.
This has already attracted interest and funding from intelligence agencies.<ref>Ria Perkins. 2014. "Linguistic identifiers of L1 Persian speakers writing in English: NLID for authorship analysis". Ph.D. thesis, Aston University.</ref>
Line 21 ⟶ 19:
== Methodology ==
[[Natural
A range of ensemble based systems have also been applied to the task and shown to improve performance over single classifier systems.<ref>Malmasi, Shervin, Sze-Meng Jojo Wong, and Mark Dras. [http://anthology.aclweb.org/W/W13/W13-1716.pdf "NLI Shared Task 2013: MQ submission"]. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. 2013.</ref><ref>Habic, Vuk, Semenov, Alexander, and Pasiliao, Eduardo. [https://www.sciencedirect.com/science/article/abs/pii/S0950705120305694 "Multitask deep learning for native language identification"] in Knowledge-Based Systems, 2020</ref>
Various linguistic feature types have been applied for this task. These include syntactic features such as constituent parses, grammatical dependencies and part-of-speech tags.
Surface level lexical features such as character, word and lemma [[n-gram]]s have also been found to be quite useful for this task. However, it seems that character n-grams<ref>Radu Tudor Ionescu, Marius Popescu and Aoife Cahill. [http://www.mitpressjournals.org/doi/abs/10.1162/COLI_a_00256 "String Kernels for Native Language Identification: Insights from Behind the Curtains"], Computational Linguistics, 2016</ref><ref>Radu Tudor Ionescu and Marius Popescu. [https://arxiv.org/abs/1707.08349 "Can string kernels pass the test of time in Native Language Identification?"], In Proceedings of BEA12, 2017.</ref> are the single best feature for the task.
== 2013
The Building Educational Applications (BEA) workshop at [[NAACL]] 2013 hosted the inaugural NLI shared task.<ref>Tetreault et al, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.5931&rep=rep1&type=pdf "A report on the first native language identification shared task"], 2013</ref> The competition resulted in 29 entries from teams across the globe, 24 of which also published a paper describing their systems and approaches.
==See also==
{{div col|
*
*
*
*
*
*
{{div col end}}
|