Language documentation tools and methods: Difference between revisions

Content deleted Content added
m Principles for recording: Fix multiple names in cs1|2 template |author= parameters (and aliases);
OAbot (talk | contribs)
m Open access bot: url-access updated in citation with #oabot.
 
(18 intermediate revisions by 14 users not shown)
Line 1:
The field of [[language documentation]] in the modern context involves a complex and ever-evolving set of tools and methods, and the study and development of their use - and, especially, identification and promotion of best practices - can be considered a sub-field of [[language documentation]] proper.<ref>{{Cite web|url=https://sites.google.com/site/ldtoolssummit/home|title=LD Tools Summit|website=sites.google.com|access-date=2016-06-02}}</ref> Among these are ethical and recording principles, workflows and methods, hardware tools, and software tools.<ref name=":0">{{Cite book|title=Linguistic Fieldwork - Springer|last=Bowern|first=Claire|doi=10.1057/9780230590168|year = 2008|isbn = 978-0-230-54538-0}}</ref>
 
== Principles and workflows ==
Line 5:
 
=== Ethics ===
Ethical practices in language documentation have been the focus of much recent discussion and debate.<ref>Austin, Peter K. 2010. 'Communities, ethics and rights in language documentation.' In Peter K. Austin, Ed., ''Language Documentation and Description Vol 7''. London, SOAS: 34-54.</ref> The [[Linguistic Society of America]] has prepared an [http://www.linguisticsociety.org/sites/default/files/Ethics_Statement.pdf Ethics Statement], and maintains an [https://lsaethics.wordpress.com/about/ Ethics Discussion Blog] which is primarily focused on ethics in the language documentation context. The [[First Peoples' Cultural Council]] and [[Endangered Languages Project]] have released a [http://fpcc.ca/linguistcode Linguist's Code of Conduct] for engaging in documentation work. The morality of ethics protocols has itself been brought into question by [[George van Driem]].<ref>{{Cite journal|last=van Driem|first=George|date=2016|title=Endangered Language Research and the Moral Depravity of Ethics Protocols|journal=Language Documentation and Conservation 10: 243-252|doi=|pmid=|hdl=10125/24693}}</ref> Most postgraduate programs that involve some form of language documentation and description require researchers to submit their proposed protocols to an internal Institutional Review Board which ensures that research is being conducted ethically. Minimally, participants should be informed of the process and the intended use of the recordings, and give recorded audible or written permission for the audiovisual materials to be used for linguistic investigation by the researcher(s). Many participants will want to be named as consultants, but others will not - this will determine whether the data needs to be anonymized or restricted from public access.
 
=== Data Formats ===
Line 14:
*[https://web.library.yale.edu/digital-initiatives/digitization-standards-and-guidelines/audiovisual Yale University Library] audiovisual guidelines
 
Most current archive standards for [[video]] use MPEG-4 (H264) as an encoding or storage format, which includes an AAC audio stream (generally of up to 320 &nbsp;kbit/s). [[Sound quality|Audio]] archive quality is at least WAV 44.1 &nbsp;kHz, 16-bit.
 
=== Principles for recording ===
Since documentation of languages is often difficult, with many languages that linguists work with being endangered (they may not be spoken in the near future), it is recommended to record at the highest quality possible given the limitations of a recorder. For video, this means recording at HD resolution (1080p or 720p) or higher when possible, while for audio this means recording minimally in uncompressed WAV,PCM 44.1&nbsp;kHz,100 samples per second, 16-bit resolution. Arguably, however, good recording techniques (isolation, microphone selection and usage, using a tripod to minimize blur) is more important than resolution. A microphone that gives a clear recording of a speaker telling a folktale (high signal/noise ratio) in MP3 format (perhaps via a phone) is better than an extremely noisy recording in WAV format where all that can be heard are cars going by. To ensure that good recordings can be obtained, linguists should practice with their recording devices as much as possible and compare the results to observe which techniques yield the best results.<ref>{{Cite book|title=Phonetic data analysis : an introduction to fieldwork and instrumental techniques|last=Ladefoged|first=Peter|date=2003|publisher=Blackwell Pub|isbn=978-0631232698|___location=Malden, MA|pages=|oclc=51818554}}</ref><ref name=":0" /><ref>{{Cite book|lastlast1=Chelliah|firstfirst1=Shobhana L.|last2=de Reuse|first2=Willem J.|date=2011|title=Handbook of Descriptive Linguistic Fieldwork|language=en-gb|doi=10.1007/978-90-481-9026-3|isbn=978-90-481-9025-6|s2cid=60322394 }}</ref><ref>{{Cite book|title=Understanding linguistic fieldwork.|author1=Meakins, Felicity |author2=Green, Jennifer |author3=Turpin, Myfany |publisher=|others=|year=2018|isbn=9781351330114|___location=London|pages=|oclc=1029352513}}</ref><ref>{{Cite book|date=2011-11-24|editor-last=Thieberger|editor-first=Nicholas|title=The Oxford Handbook of Linguistic Fieldwork|url=http://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199571888.001.0001/oxfordhb-9780199571888|language=en-US|doi=10.1093/oxfordhb/9780199571888.001.0001|isbn=9780191744112|publisher=Oxford University Press}}</ref>
 
=== Workflows ===
Line 40:
Directional microphones should be used in most cases, in order to isolate a speaker's voice from other potential noise sources. However, omnidirectional microphones may be preferred in situations involving larger numbers of speakers arrayed in a relatively large space. Among directional microphones, [[Cardioid microphone|cardioid]] microphones are suitable for most applications, however in some cases a [[hypercardioid]] ("shotgun") microphone may be preferred.
 
Good quality headset microphones are comparatively expensive, but can produce recordings of extremely high quality in controlled situations.<ref>{{Cite journal|lastlast1=Švec|firstfirst1=Jan G.|last2=Granqvist|first2=Svante|date=2010-11-01|title=Guidelines for Selecting Microphones for Human Voice Production Research|url=https://ajslp.pubs.asha.org/article.aspx?articleid=1767774|journal=American Journal of Speech-Language Pathology|language=en|volume=19|issue=4|pages=356–368|doi=10.1044/1058-0360(2010/09-0091)|pmid=20601621|issn=1058-0360|url-access=subscription}}</ref> [[Lavalier microphone|Lavalier]] or "lapel" microphones may be used in some situations, however, depending on the microphone they can produce recordings which are inferior to a headset microphone for phonetic analysis, and are subject to some of the same concerns that headset microphones are in terms of restriction of a recording to a single speaker - while other speakers may be audible on the recording, they will be backgrounded in relation to the speaker wearing the lavalier microphone.<ref>{{Cite journal|last=Brixen|first=Eddy|date=1996-05-01|title=Spectral Degradation of Speech Captured by Miniature Microphones Mounted on Persons' Heads and Chests|url=http://www.aes.org/e-lib/browse.cfm?elib=7495|journal=Audio Engineering Society Convention 100|language=English|volume=|pages=|via=en}}</ref>
 
Some good quality microphones used for film-making and interviews include the [http://www.rode.com/microphones/video Røde VideoMic shotgun and the Røde lavalier series], [http://www.shure.com/americas/products/microphones/beta/beta-53-headworn-microphone Shure headworn mics] and [http://www.shure.com/americas/search?utf8=%E2%9C%93&keyword=lavalier#keyword=lavalier&category_1=Microphones Shure lavaliers]. Depending on the recorder and microphone, additional [[Audio and video interfaces and connectors|cables]] (XLR, stereo/mono converter or a [https://www.amazon.com/Rode-SC3-3-5mm-TRRS-Adaptor/dp/B00L6C8PNU TRRS to TRS adapter]) will be necessary.
Line 56:
 
=== SayMore ===
[httphttps://wwwsoftware.sil.org/resourcessaymore/software_fonts/saymore SayMore] is a language documentation package developed by [[SIL International]] in [[Dallas]] which primarily focuses on the initial stages in language documentation, and aims for a relatively uncomplicated user experience.
 
The primary functions of SayMore are: (a) audio recording (b) file import from recording device (video and/or audio) (c) file organization (d) metadata entry at session and file levels (e) association of AV files with evidence of informed consent and other supplementary objects (such as photographs) (f) AV file segmentation (g) transcription/translation (h) [https://sites.google.com/site/boldpng/bold BOLD]-style Careful Speech annotation and Oral Translation.
 
SayMore files can be further exported for annotation in [httphttps://fieldworkssoftware.sil.org/fieldworks/ FLEx], and metadata can be exported in [[Comma-separated values|.csv]] and [[IMDI]] formats for archiving.
 
=== ELAN ===
Line 69:
 
=== Toolbox ===
[https://software.sil.org/toolbox/ Field Linguist's Toolbox] (usually called Toolbox) is a precursor of [httphttps://fieldworkssoftware.sil.org/fieldworks/ FLEx] and has been one of the most widely used language documentation packages for some decades. Previously known as [httphttps://www-01software.sil.org/computing/shoebox/index.html?_ga=GA1.2.2087213860.1467275369 Shoebox], Toolbox's primary functions are construction of a lexical database, and interlinearization of texts through interaction with the lexical database. Both lexical database and texts can be exported to a word processing environment, in the case of the lexical database using the Multi-Dictionary Formatter ([httphttps://www-01software.sil.org/computing/shoebox/MDF.htmlmdf/ MDF]) conversion tool. It is also possible to use Toolbox as a transcription environment.<ref>{{Cite journal|last=Margetts|first=Andrew|date=2009|title=Using Toolbox with Media Files|journal=Language Documentation & Conservation |volume=3 |issue=1 |pages=51–86|doi=|pmid=|hdl=10125/4426}}</ref> By comparison with ELAN and FLEx, Toolbox has relatively limited functionality, and is felt by some to have an unintuitive design and interface. However, a large number of projects have been carried-out in the Shoebox/Toolbox environment over its lifespan, and its user base continues to enjoy its advantages of familiarity, speed, and community support. Toolbox also has the advantage of working directly with human-readable Texttext files that can be opened in any text editor and easily manipulated and archived. Toolbox files can also be easily converted for storage in XML (recommended for archives), such as with open source Python libraries like [https://github.com/xigt/xigt Xigt] intended for computational uses of IGT data.
 
=== Tools for automating components of the workflow ===
Language documentation may be partially automated thanks to a number of software tools, including:
* [[ESpeakNG|eSpeak]]
*Maus
* [[HTK (software)|HTK]]
*Sox
* [[Lingua Libre]], a [[FLOSS|libre]] online tool allowing to record a large number of words and phrases in a short period (up to 1 000 words/hour with a clean word list and an experienced user). It automatizes the classic procedure for recording audio and video pronunciation files (for [[Spoken language|spoken]] and [[Sign language|signed]] languages). Once the recording is done, the platform automatically uploads clean, well cut, well named and apps-friendly files, directly to [[c:Category:Lingua_Libre_pronunciation|Wikimedia Commons]] (it is possible to download datasets for a specific language).
*Prosodylab Aligner
* Maus
*[[ESpeakNG|eSpeak]]
* Prosodylab Aligner
*[[HTK (software)|HTK]]
* Sox
 
== Literature ==
The peer-reviewed journal [http://nflrc.hawaii.edu/ldc/ Language Documentation and Conservation] has published a large number of articles focusing on tools and methods in language documentation.
 
== Film ==
The 2021 Indian documentary film [[Dreaming of Words]] traces the life and work of [[Njattyela Sreedharan]], a fourth standard drop-out, who compiles a multilingual dictionary connecting four major [[Dravidian languages]] [[Malayalam]], [[Kannada]], [[Tamil language|Tamil]] and [[Telugu language|Telugu]].<ref>{{Cite web|url=https://bookofachievers.com/articles/82-yo-compiles-dictionary-of-4-dravidian-languages-useful-ofcourse|title = 82-year-old Kerala man's Dictionary is in the four Dravidian languages. 25 long years to compile}}</ref><ref>{{Cite web|url=https://www.thebetterindia.com/246205/83-yo-kerala-school-dropout-creates-unique-dictionary-in-4-south-indian-languages-vid01/|title=83-YO Kerala School Dropout Creates Unique Dictionary in 4 South Indian Languages|date=31 December 2020}}</ref><ref>{{Cite news|url=https://www.thehindu.com/news/national/kerala/for-keralites-door-opens-to-three-other-dravidian-languages/article32986464.ece|title = For Keralites, door opens to three other Dravidian languages|newspaper = The Hindu|date = 30 October 2020|last1 = Sajit|first1 = C. p.}}</ref> Travelling across four states and doing extensive research, he spent twenty five years<ref>{{Cite web|url=https://silvertalkies.com/the-man-who-wrote-a-dictionary-in-four-languages/|title=The Man Who Wrote A Dictionary In Four Languages – Silver Talkies|website=silvertalkies.com}}</ref> making this multilingual dictionary.
 
== See also ==
 
[https://web.archive.org/web/20181026095442/http://www.resourcebook.eu/ LRE Map] Language resources map
Searchable by Resource Type, Language(s), Language type, Modality, Resource Use, Availability, Production Status, Conference(s), Resource name