Language documentation tools and methods: Difference between revisions

Content deleted Content added
No edit summary
re-ordered some headings, added information under Ethics and Workflows headings
Line 1:
The field of [[language documentation]] in the modern context involves a complex and ever-evolving set of tools and methods, and the study and development of their use - and, especially, identification and promotion of best practices - can be considered a sub-field of [[language documentation]] proper.<ref>{{Cite web|url=https://sites.google.com/site/ldtoolssummit/home|title=LD Tools Summit|website=sites.google.com|access-date=2016-06-02}}</ref> Among these are hardwareethical tools,and softwarerecording toolsprinciples, workflows and methods, hardware tools, and ethicalsoftware practicestools.<ref name=":0">{{Cite book|url=https://link.springer.com/10.1057/9780230590168|title=Linguistic Fieldwork - Springer|last=Bowern|first=Claire|doi=10.1057/9780230590168}}</ref>
 
== WorkflowsPrinciples and other methodsworkflows ==
Researchers in language documentation often begin withconduct linguistic fieldwork, byto gather the data on which their work is based, recording audiovisual files that document language use in traditional contexts. Because the types of environment in which linguistic fieldwork often takes place may be logistically challenging, not every type of recording tool is necessarilynecessary or ideal, and compromises must often be struck between quality, cost and usability. It is also important to envision the remainder of one's complete workflow and intended outcomes; for example, if video files are made, some amount of processing may be required to expose the audio component to processing in various ways by different software packages.
 
=== Ethics ===
Ethical practices in language documentation have been the focus of much recent discussion and debate.<ref>Austin, Peter K. 2010. 'Communities, ethics and rights in language documentation.' In Peter K. Austin, Ed., ''Language Documentation and Description Vol 7''. London, SOAS: 34-54.</ref> The [[Linguistic Society of America]] has prepared an [http://www.linguisticsociety.org/sites/default/files/Ethics_Statement.pdf Ethics Statement], and maintains an [https://lsaethics.wordpress.com/about/ Ethics Discussion Blog] which is primarily focused on ethics in the language documentation context. The morality of ethics protocols has itself been brought into question by [[George van Driem]].<ref>{{Cite journal|last=van Driem|first=George|date=2016|title=Endangered Language Research and the Moral Depravity of Ethics Protocols|url= http://hdl.handle.net/10125/24693|journal=Language Documentation and Conservation 10: 243-252|doi=|pmid=|access-date=}}</ref> Most postgraduate programs inthat Languageinvolve Documentationsome form of language documentation and Descriptiondescription require research proposalsresearchers to submit their proposed protocols to an internal Institutional Review Board which ensures that research is being conducted ethically. Minimally, participants should be informed of the process and the intended use of the recordings, and give recorded audible or written permission for the audiovisual materials to be used for linguistic investigation by the researcher(s). Many participants will want to be named as consultants, but others will not - this will determine whether the data needs to be anonymized or restricted from public access.
 
=== Data Formats ===
Standards for formats are critical for interoperability between software tools, e.g. [[OLAC]]. Many individual archives or data repositories have their own standards and requirements for data deposited on their servers - knowledge of these requirements ought to inform the data collection strategy and tools used, and should be part of a [[data management plan]] developed before the start of research. Some example guidelines from well-used repositories are given below:
 
* [https://www.soas.ac.uk/elar/helpsheets/ Endangered Languages Archive (ELAR)] guidelines
* [http://www.mpi.nl/corpus/html/lamus/apa.html Max Planck Institute Archive] accepted formats
* [https://web.library.yale.edu/digital-initiatives/digitization-standards-and-guidelines/audiovisual Yale University Library] audiovisual guidelines
 
Most current archive standards for [[video]] use MPEG-4 encoding(H264) as aan encoding or storage format, which includes an AAC audio stream of up to 320 kbps. [[Sound quality|Audio]] archive quality is at least WAV 44.1 khz, 16-bit.
 
=== Principles for recording ===
Since documentation of languages is often difficult, with many languages that linguists work with being endangered (they may not be spoken in the near future), it is recommended to record at the highest quality possible given the limitations of a recorder. For video, this means recording at HD resolution (1080p or 720p) or higher when possible, while for audio this means recording minimally in uncompressed WAV, 44.1khz, 16-bit resolution. Arguably, however, good recording techniques (isolation, microphone selection and usage, using a tripod to minimize blur) is more important than resolution. A microphone that gives a clear recording of a speaker telling a folktale (high signal/noise ratio) in MP3 format (such as to a phone) is better than an extremely noisy recording in WAV format where all that can be heard are cars going by. To ensure that good recordings can be obtained, linguists should practice with their recording devices as much as possible and compare the results to observe which techniques yield the best results.<ref>{{Cite book|url=https://www.worldcat.org/oclc/51818554|title=Phonetic data analysis : an introduction to fieldwork and instrumental techniques|last=Ladefoged|first=Peter|date=2003|publisher=Blackwell Pub|year=|isbn=0631232699|___location=Malden, MA|pages=|oclc=51818554}}</ref><ref name=":0" /><ref>{{Cite journal|last=Chelliah|first=Shobhana L.|last2=de Reuse|first2=Willem J.|date=2011|title=Handbook of Descriptive Linguistic Fieldwork|url=http://link.springer.com/10.1007/978-90-481-9026-3|language=en-gb|doi=10.1007/978-90-481-9026-3}}</ref><ref>{{Cite book|url=https://www.worldcat.org/oclc/1029352513|title=Understanding linguistic fieldwork.|last=Meakins, Felicity; Green, Jennifer; Turpin, Myfany|first=|publisher=|others=|year=2018|isbn=9781351330114|___location=London|pages=|oclc=1029352513}}</ref><ref>{{Cite journal|date=2011-11-24|editor-last=Thieberger|editor-first=Nicholas|title=The Oxford Handbook of Linguistic Fieldwork|url=http://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199571888.001.0001/oxfordhb-9780199571888|language=en-US|doi=10.1093/oxfordhb/9780199571888.001.0001}}</ref>
 
=== Workflows ===
For many linguists the end-result of making recordings is language analysis, often grammatical investigation of a language's structural properties via one of the software tools listed below. This requires transcription of the audio, generally in collaboration with native speakers of the language in question. For general transcription, media files can be played back on the computer and paused for transcription in a text editor. Other (cross-platform) tools to assist this process include [https://www.audacityteam.org/ Audacity] and [[sourceforge:projects/trans/files/transcriber/1.5.1/|Transcriber]], while a program like [https://tla.mpi.nl/tools/tla-tools/elan/ ELAN] (described further below) can also perform this function.
 
Programs like [https://software.sil.org/toolbox/ Toolbox] or [https://software.sil.org/fieldworks/ FLEx] are often preferred by linguists who want to be able to [[Interlinear gloss|interlinearize]] their texts, as these programs build a dictionary of forms and parsing rules to help speed up analysis. Unfortunately, media files are generally not linked by these programs (as they are in ELAN), making it difficult to view or listen back to recordings to check transcriptions. There is [https://github.com/lingdoc/trs2txt currently a workaround] for Toolbox that allows timecodes to reference an audio file and enable playback (of a complete text or a referenced sentence) from within Toolbox - in this workflow, time-alignment of text is performed in Transcriber, and then the relevant timecodes and text are converted into a format that Toolbox can read.
 
== Hardware ==
Line 14 ⟶ 31:
The [https://www.zoom.co.jp/products/field-video-recording/video-recording Zoom] series, particularly the [https://www.zoom-na.com/products/field-video-recording/video-recording/zoom-q8/specs Q8], [https://www.zoom.co.jp/products/field-video-recording/video-recording/q4n-handy-video-recorder#specs Q4n], and [https://www.zoom.co.jp/products/field-video-recording/video-recording/q2n-handy-video-recorder#specs Q2n], which record to multiple audio formats, including WAV (44.1/48/96khz, 16/24-bit).
 
When using a video recorder that does not record audio in WAV format (such as most DSLR cameras), it is recommended to record audio separately on another recorder, following some of the guidelines below. As with the audio recorders described below, many video recorders also accept microphone input of various kinds, which(generally through an 1/8-inch or TRS connector) - this can ensure a high-quality backup audio recording that is in sync with the recorded video, which can be helpful in some cases (i.e. for transcription).
 
=== Audio recorders and microphones ===
Line 61 ⟶ 78:
*[[ESpeakNG|eSpeak]]
*[[HTK_(software)|HTK]]
 
=== Data Formats ===
Standards for formats are critical for interoperability between software tools, e.g. [[OLAC]]. Many individual archives or data repositories have their own standards and requirements for data deposited on their servers - knowledge of these requirements ought to inform the data collection strategy and tools used, and should be part of a data management plan developed before the start of research. Some example guidelines from well-used repositories are given below:
 
* [https://www.soas.ac.uk/elar/helpsheets/ Endangered Languages Archive (ELAR)] guidelines
* [http://www.mpi.nl/corpus/html/lamus/apa.html Max Planck Institute Archive] accepted formats
* [https://web.library.yale.edu/digital-initiatives/digitization-standards-and-guidelines/audiovisual Yale University Library] audiovisual guidelines
 
Most current archive standards for [[video]] use MPEG-4 encoding as a storage format, which includes an AAC audio stream of up to 320 kbps. [[Sound quality|Audio]] archive quality is at least WAV 44.1 khz, 16-bit.
 
== Ethics ==
Ethical practices in language documentation have been the focus of much recent discussion and debate.<ref>Austin, Peter K. 2010. 'Communities, ethics and rights in language documentation.' In Peter K. Austin, Ed., ''Language Documentation and Description Vol 7''. London, SOAS: 34-54.</ref> The [[Linguistic Society of America]] has prepared an [http://www.linguisticsociety.org/sites/default/files/Ethics_Statement.pdf Ethics Statement], and maintains an [https://lsaethics.wordpress.com/about/ Ethics Discussion Blog] which is primarily focused on ethics in the language documentation context. The morality of ethics protocols has itself been brought into question by [[George van Driem]].<ref>{{Cite journal|last=van Driem|first=George|date=2016|title=Endangered Language Research and the Moral Depravity of Ethics Protocols|url= http://hdl.handle.net/10125/24693|journal=Language Documentation and Conservation 10: 243-252|doi=|pmid=|access-date=}}</ref> Most postgraduate programs in Language Documentation and Description require research proposals to submit to an internal Institutional Review Board which ensures that research is being conducted ethically.
 
== Literature ==