Native-language identification: Difference between revisions

Content deleted Content added
m ce
Line 1:
'''Native-language identification''' (NLI) is the task of determining an author's [[first language|native language]] (L1) based only on their writings in a second language ([[Secondsecond language|L2]] (L2).<ref>Wong, Sze-Meng Jojo, and Mark Dras. [http://anthology.aclweb.org/D/D11/D11-1148.pdf "Exploiting parse structures for native language identification"]. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.</ref>
'''Native Language Identification''' (NLI) is the task of determining an author's native language ([[First language|L1]]) based only
NLI works through identifying language -usage patterns that are common to specific L1 groups and then applying this knowledge is then applied to predict the mothernative tonguelanguage of previously unseen texts.
on their writings in a second language ([[Second language|L2]]).<ref>Wong, Sze-Meng Jojo, and Mark Dras. [http://anthology.aclweb.org/D/D11/D11-1148.pdf "Exploiting parse structures for native language identification"]. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.</ref>
This is motivated in part by applications in [[Secondsecond-language Language Acquisitionacquisition]], Language teaching and [[Forensicforensic Linguisticslinguistics]], amongst others.
NLI works through identifying language usage patterns that are common to specific L1 groups and this knowledge is then applied to predict the mother tongue of previously unseen texts.
This is motivated in part by applications in [[Second Language Acquisition]], Language teaching and [[Forensic Linguistics]], amongst others.
 
== Overview ==
NLI works under the assumption that an author's L1 will dispose them towards particular language production patterns in their L2, as influenced by their mothernative tonguelanguage. This relates to Crosscross-Linguisticlinguistic Influenceinfluence (CLI), a key topic in the field of Second Languagesecond-language Acquisitionacquisition (SLA) that analyzes transfer effects from the L1 on later learned languages.
 
Using large-scale English data, NLI methods achieve over 80% accuracy in predicting the mothernative tonguelanguage of texts written by authors from 11 different L1 backgrounds. This can be compared to a baseline of 9% for choosing randomly.
 
==Applications==
 
===Pedagogy and Languagelanguage Transfertransfer===
This identification of L1-specific features has been used to study [[language transfer]] effects in Second Languagesecond-language Acquisitionacquisition.<ref>Malmasi, Shervin, and Mark Dras. [http://www.aclweb.org/anthology/D/D14/D14-1144.pdf "Language Transfer Hypotheses with Linear SVM Weights."] Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014.</ref> This is useful for developing pedagogical material, teaching methods, L1-specific instructions and generating learner feedback that is tailored to their mothernative tonguelanguage.
 
===Forensic Linguisticslinguistics===
NLI methods can also be applied in [[Forensicforensic Linguisticslinguistics]] as a method of performing Authorshipauthorship Profilingprofiling in order to infer the attributes of an author, including their linguistic background.
This is particularly useful in situations where a text, e.g. an anonymous letter, is the key piece of evidence in an investigation and clues about the native language of a writer can help investigators in identifying the source.
This has already attracted interest and funding from intelligence agencies.<ref>Ria Perkins. 2014. "Linguistic identifiers of L1 Persian speakers writing in English: NLID for authorship analysis". Ph.D. thesis, Aston University.</ref>
Line 21 ⟶ 20:
== Methodology ==
 
[[Natural Languagelanguage Processingprocessing]] methods are used to extract and identify language usage patterns common to speakers of an L1-group. This is done using language learner data, usually from a [[learner corpus]]. Next, [[Machinemachine learning]] is applied to train classifiers, like [[Support Vector Machine|Supportsupport Vectorvector Machinesmachine]]s, for predicting the L1 of unseen texts.<ref>Tetreault et al, [http://anthology.aclweb.org/C/C12/C12-1158.pdf "Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification"], In Proc. International Conf. on Computational Linguistics (COLING), 2012</ref>
A range of ensemble based systems have also been applied to the task and shown to improve performance over single classifier systems.<ref>Malmasi, Shervin, Sze-Meng Jojo Wong, and Mark Dras. [http://anthology.aclweb.org/W/W13/W13-1716.pdf "NLI Shared Task 2013: MQ submission"]. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. 2013.</ref>
 
Line 27 ⟶ 26:
Surface level lexical features such as character, word and lemma [[n-gram|n-grams]] have also been found to be quite useful for this task.
 
== 2013 Sharedshared Tasktask ==
The Building Educational Applications (BEA) workshop at [[NAACL]] 2013 hosted the inaugural NLI shared task.<ref>Tetreault et al, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.5931&rep=rep1&type=pdf "A report on the first native language identification shared task"], 2013</ref> The competition resulted in 29 entries from teams across the globe, 24 of which also published a paper describing their systems and approaches.