Content deleted Content added
Weas3l5491 (talk | contribs) mNo edit summary |
|||
(22 intermediate revisions by 18 users not shown) | |||
Line 1:
{{Short description|Determining someone's first language based on how they write or speak a different language}}
'''Native-language identification''' ('''NLI''') is the task of determining an author's [[first language|native language]] (L1) based only on their writings in a
▲on their writings in a second language ([[Second language|L2]]).<ref>Wong, Sze-Meng Jojo, and Mark Dras. [http://anthology.aclweb.org/D/D11/D11-1148.pdf "Exploiting parse structures for native language identification"]. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.</ref>
== Overview ==
NLI works under the assumption that an author's L1 will dispose them towards particular language production patterns in their L2, as influenced by their native language. This relates to cross-linguistic influence (CLI), a key topic in the field of second-language acquisition (SLA) that analyzes transfer effects from the L1 on later learned languages.
Using large-scale English data, NLI methods achieve over 80% accuracy in predicting the native language of texts written by authors from 11 different L1 backgrounds.<ref>Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christopher Hamill, Diane Napolitano, and Yao Qian. 2017. [https://aclanthology.org/W17-5007/ "A Report on the 2017 Native Language Identification Shared Task"]. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 62–75, Copenhagen, Denmark. Association for Computational Linguistics.</ref> This can be compared to a baseline of 9% for choosing randomly.
==Applications==
This identification of L1-specific features has been used to study [[language transfer]] effects in
▲===Pedagogy and Language Transfer===
NLI methods can also be applied in [[
▲This identification of L1-specific features has been used to study [[language transfer]] effects in Second Language Acquisition.<ref>Malmasi, Shervin, and Mark Dras. [http://www.aclweb.org/anthology/D/D14/D14-1144.pdf "Language Transfer Hypotheses with Linear SVM Weights."] Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014.</ref> This is useful for developing pedagogical material, teaching methods, L1-specific instructions and generating learner feedback that is tailored to their mother tongue.
▲===Forensic Linguistics===
▲NLI methods can also be applied in [[Forensic Linguistics]] as a method of performing Authorship Profiling in order to infer the attributes of an author, including their linguistic background.
This is particularly useful in situations where a text, e.g. an anonymous letter, is the key piece of evidence in an investigation and clues about the native language of a writer can help investigators in identifying the source.
This has already attracted interest and funding from intelligence agencies.<ref>Ria Perkins. 2014. "Linguistic identifiers of L1 Persian speakers writing in English: NLID for authorship analysis". Ph.D. thesis, Aston University.</ref>
== Methodology ==
[[Natural
A range of ensemble based systems have also been applied to the task and shown to improve performance over single classifier systems.<ref>Malmasi, Shervin, Sze-Meng Jojo Wong, and Mark Dras. [http://anthology.aclweb.org/W/W13/W13-1716.pdf "NLI Shared Task 2013: MQ submission"]. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. 2013.</ref><ref>Habic, Vuk, Semenov, Alexander, and Pasiliao, Eduardo. [https://www.sciencedirect.com/science/article/abs/pii/S0950705120305694 "Multitask deep learning for native language identification"] in Knowledge-Based Systems, 2020</ref>
Various linguistic feature types have been applied for this task. These include syntactic features such as constituent parses, grammatical dependencies and part-of-speech tags.
Surface level lexical features such as character, word and lemma [[n-gram]]s have also been found to be quite useful for this task. However, it seems that character n-grams<ref>Radu Tudor Ionescu, Marius Popescu and Aoife Cahill. [http://www.mitpressjournals.org/doi/abs/10.1162/COLI_a_00256 "String Kernels for Native Language Identification: Insights from Behind the Curtains"], Computational Linguistics, 2016</ref><ref>Radu Tudor Ionescu and Marius Popescu. [https://arxiv.org/abs/1707.08349 "Can string kernels pass the test of time in Native Language Identification?"], In Proceedings of BEA12, 2017.</ref> are the single best feature for the task.
== 2013
The Building Educational Applications (BEA) workshop at [[NAACL]] 2013 hosted the inaugural NLI shared task.<ref>Tetreault et al, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.5931&rep=rep1&type=pdf "A report on the first native language identification shared task"], 2013</ref> The competition resulted in 29 entries from teams across the globe, 24 of which also published a paper describing their systems and approaches.
==See also==
{{div col|
*
*
*
*
*
*
{{div col end}}
==References==
{{reflist}}
{{DEFAULTSORT:Natural Language Processing}}
Line 63 ⟶ 47:
[[Category:Machine learning]]
[[Category:Applied linguistics]]
[[Category:Bilingualism]]
|