Speech coding: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 22:40, 24 March 2023 edit Dondervogel 2 (talk \| contribs) Extended confirmed users 17,902 edits m →Modern speech compression: low data rate is not a requirement; it's a means to an end ← Previous edit		Latest revision as of 22:11, 17 December 2024 edit undo Kvng (talk \| contribs) Extended confirmed users, New page reviewers 115,948 edits m avoid bit/s wrap at slash
(35 intermediate revisions by 13 users not shown)
Line 1: {{Short description\|Lossy audio compression applied to human speech}} {{Use American English\|date=May 2022}} ~~{{multiple issues\|~~ {{more citations needed\|date=January 2013}} ~~{{essay\|date=November 2011}}~~ }} '''Speech coding''' is an application of [[data compression]] ofto [[digital audio]] signals containing [[speech]]. Speech coding uses speech-specific [[parameter estimation]] using [[audio signal processing]] techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.<ref>{{cite journal\|first1=M. \|last1=Arjona Ramírez ~~and~~ \|first2=M. ~~Minami, "~~\|last2=Minam\|title=Low bit rate speech coding~~," in~~ \|journal=Wiley Encyclopedia of Telecommunications, J. G. Proakis, Ed., \|___location=New York:\| publisher=Wiley, \|year=2003,\| ~~vol.~~volume= 3~~, pp. 1299-1308.~~\|pages=1299–1308}}</ref> ~~Some~~Common applications of speech coding are [[mobile telephony]] and [[voice over IP]] (VoIP).<ref>M. Arjona Ramírez and M. Minami, "Technology and standards for low-bit-rate vocoding methods," in The Handbook of Computer Networks, H. Bidgoli, Ed., New York: Wiley, 2011, vol. 2, pp. 447–467.</ref> The most widely used speech coding technique in mobile telephony is [[linear predictive coding]] (LPC), while the most widely used in VoIP applications are the LPC and [[modified discrete cosine transform]] (MDCT) techniques.{{Citation needed\|date=December 2019}} The techniques employed in speech coding are similar to those used in [[audio data compression]] and [[audio coding]] where ~~knowledge~~appreciation inof [[psychoacoustics]] is used to transmit only data that is relevant to the human auditory system. For example, in [[voiceband]] speech coding, only information in the frequency band 400 to 3500 Hz is transmitted but the reconstructed signal ~~is still~~retains adequate ~~for~~ [[Intelligibility (communication)\|intelligibility]]. Speech coding differs from other forms of audio coding in that speech is a simpler signal than ~~most~~ other audio signals, and ~~a lot more~~ statistical information is available about the properties of speech. As a result, some auditory information that is relevant in general audio coding can be unnecessary in the speech coding context. ~~In speech~~Speech coding, stresses the ~~most important criterion is~~ preservation of intelligibility and ''pleasantness'' of speech, ~~with~~while using a constrained amount of transmitted data.<ref>P. Kroon, "Evaluation of speech coders," in Speech Coding and Synthesis, W. Bastiaan Kleijn and K. K. Paliwal, Ed., Amsterdam: Elsevier Science, 1995, pp. 467-494.</ref> In addition, most speech applications require low coding delay, as ~~long~~[[Latency ~~coding delays~~(audio)\|latency]] ~~interfere~~interferes with speech interaction.<ref>J. H. Chen, R. V. Cox, Y.-C. Lin, N. S. Jayant, and M. J. Melchner, A low-delay CELP coder for the CCITT 16 kb/s speech coding standard. IEEE J. Select. Areas Commun. 10(5): 830-849, June 1992.</ref> == Categories == Speech coders are of two ~~types~~classes:<ref>{{cite web \|url = http://users.ece.gatech.edu/~juang/8873/Bae-LPC10.ppt \|title = Soo Hyun Bae, ECE 8873 Data Compression & Modeling, Georgia Institute of Technology , 2004 \|archive-url=https://web.archive.org/web/20060907225836/http://users.ece.gatech.edu/~juang/8873/Bae-LPC10.ppt \|archive-date=7 September 2006 \|url-status=dead}}</ref> # Waveform coders #* Time-___domain: [[PCM]], [[ADPCM]] #* Frequency-___domain: [[sub-band coding]], [[~~Adaptive Transform Acoustic Coding\|~~ATRAC]] # [[Vocoder]]s #* [[Linear predictive coding]] (LPC) #* [[Formant synthesis\|Formant coding]] #* [[Machine learning]], i.e. [[Deep learning speech synthesis#Neural vocoder\|neural vocoder]]<ref>{{cite journal \|last1=Zeghidour \|first1=Neil \|last2=Luebs \|first2=Alejandro \|last3=Omran \|first3=Ahmed \|last4=Skoglund \|first4=Jan \|last5=Tagliasacchi \|first5=Marco \|title=SoundStream: An End-to-End Neural Audio Codec \|journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing \|date=2022 \|volume=30 \|pages=495–507 \|doi=10.1109/TASLP.2021.3129994\|arxiv=2107.03312\|s2cid=236149944 }}</ref> == Sample companding viewed as a form of speech coding == The [[~~A-law algorithm\|~~A-law]] and [[μ-law algorithm]]s (used in [[G.711]]) ~~used in traditional [[Pulse-code modulation\|~~PCM]] [[digital telephony]] can be seen as an earlier precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12 [[audio bit depth\|bits of resolution]].<ref>{{cite book\|first1=N. S. \|last1=Jayant ~~and~~ \|first2=P.\|last2= Noll,\|title= Digital coding of waveforms.\|___location= Englewood Cliffs:\|publisher= Prentice-Hall, \|year=1984.}}</ref> ~~The logarithmic~~Logarithmic companding ~~laws~~ are consistent with human hearing perception in that a low-amplitude noise is heard along a low-amplitude speech signal but is masked by a high-amplitude one. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as a [[~~periodic function\|~~periodic waveform]] having a single [[fundamental frequency]] with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech.{{citation needed\|date=July 2023}}{{dubious\|discuss=Logarithmic companding for music\|date=July 2023}} A wide variety of other algorithms were tried at the time, mostly [[delta modulation]] variants, but after careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made an excellent engineering compromise. Their audio performance remains acceptable, and there was no need to replace them in the stationary phone network.{{citation needed\|date=July 2023}} In 2008, [[G.711.1]] codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.<ref name="g711-1-2012">{{citation \|publisher=ITU-T \|date=2012 \|url=http://www.itu.int/rec/T-REC-G.711.1/en \|title=G.711.1 : Wideband embedded extension for G.711 pulse code modulation \|access-date=2022-12-24}}</ref> == Modern speech compression == Much of the later work in speech compression was motivated by military research into digital communications for [[Secure voice\|secure military radios]], where very low data rates were used to ~~achieved~~achieve effective operation in a hostile radio environment. At the same time, far more [[processing power]] was available, in the form of [[Very Large Scale Integration\|VLSI circuits]], than was available for earlier compression techniques. As a result, modern speech compression algorithms could use far more complex techniques than were available in the 1960s to achieve far higher compression ratios. These techniques were available through the open research literature to be used for civilian applications, allowing the creation of digital [[mobile phone network]]s with substantially higher channel capacities than the analog systems that preceded them.{{Citation needed\|date=December 2019}} The most widely used speech coding algorithms are based on [[linear predictive coding]] (LPC).<ref>{{cite journal \|last1=Gupta \|first1=Shipra \|title=Application of MFCC in Text Independent Speaker Recognition \|journal=International Journal of Advanced Research in Computer Science and Software Engineering \|date=May 2016 \|volume=6 \|issue=5 \|pages=805–810 (806) \|s2cid=212485331 \|issn=2277-128X \|url=https://pdfs.semanticscholar.org/2aa9/c2971342e8b0b1a0714938f39c406f258477.pdf \|archive-url=https://web.archive.org/web/20191018231621/https://pdfs.semanticscholar.org/2aa9/c2971342e8b0b1a0714938f39c406f258477.pdf \|url-status=dead \|archive-date=2019-10-18 \|access-date=18 October 2019}}</ref> In particular, the most common speech coding scheme is the LPC-based [[code-excited linear prediction]] (CELP) coding, which is used for example in the [[GSM]] standard. In CELP, the modeling is divided in two stages, a [[linear prediction\|linear predictive]] stage that models the spectral envelope and a code-book-based model of the residual of the linear predictive model. In CELP, linear prediction coefficients (LPC) are computed and quantized, usually as [[line spectral pairs]] (LSPs). In addition to the actual speech coding of the signal, it is often necessary to use [[channel coding]] for transmission, to avoid losses due to transmission errors. In order to get the best overall coding results, speech coding and channel coding methods are chosen in pairs, with the more important bits in the speech data stream protected by more robust channel coding. The [[modified discrete cosine transform]] (MDCT), ais ~~type~~used ofin ~~[[discrete cosine transform]] (DCT) algorithm, was adapted into a speech coding algorithm called~~the LD-MDCT, technique used ~~for~~by the [[AAC-LD]] format introduced in 1999.<ref name="Schnell">{{cite conference \|last1=Schnell\|first1=Markus \|last2=Schmidt \|first2=Markus \|last3=Jander \|first3=Manuel \|last4=Albert \|first4=Tobias \|last5=Geiger \|first5=Ralf \|last6=Ruoppila \|first6=Vesa \|last7=Ekstrand \|first7=Per \|last8=Bernhard \|first8=Grill \|date=October 2008 \|title=MPEG-4 Enhanced Low Delay AAC - A New Standard for High Quality Communication \|url=https://www.iis.fraunhofer.de/content/dam/iis/de/doc/ame/conference/AES-125-Convention_AAC-ELD-NewStandardForHighQualityCommunication_AES7503.pdf \|conference=125th AES Convention \|publisher=[[Audio Engineering Society]] \|access-date=20 October 2019 \|website=[[Fraunhofer IIS]]}}</ref> MDCT has since been widely adopted in [[voice-over-IP]] (VoIP) applications, such as the [[G.729.1]] [[wideband audio]] codec introduced in 2006,<ref name="Nagireddi">{{cite book \|last1=Nagireddi \|first1=Sivannarayana \|title=VoIP Voice and Fax Signal Processing \|date=2008 \|publisher=[[John Wiley & Sons]] \|isbn=9780470377864 \|page=69 \|url=https://books.google.com/books?id=5AneeZFE71MC&pg=PA69}}</ref> [[Apple Inc.\|Apple]]'s [[FaceTime]] (using AAC-LD) introduced in 2010,<ref name="AppleInsider standards 1">{{cite web\|url=http://www.appleinsider.com/articles/10/06/08/inside_iphone_4_facetime_video_calling.html\|date=June 8, 2010\|access-date=June 9, 2010\|title=Inside iPhone 4: FaceTime video calling\|publisher=[[~~Apple community#AppleInsider\|~~AppleInsider]]\|author=Daniel Eran Dilger}}</ref> and the [[CELT]] codec introduced in 2011.<ref name="presentation">[http://people.xiph.org/~greg/video/linux_conf_au_CELT_2.ogv Presentation of the CELT codec] {{Webarchive\|url=https://web.archive.org/web/20110807182250/http://people.xiph.org/~greg/video/linux_conf_au_CELT_2.ogv \|date=2011-08-07 }} by Timothy B. Terriberry (65 minutes of video, see also [http://www.celt-codec.org/presentations/misc/lca-celt.pdf presentation slides] in PDF)</ref> [[Opus (audio format)\|Opus]] is a [[free software]] audio coder. It combines ~~both~~ the ~~MDCT~~speech-oriented ~~(CELT)~~LPC-based [[SILK]] algorithm and ~~LPC~~the ~~(SILK)~~lower-latency ~~audio~~MDCT-based ~~compression~~CELT ~~algorithms~~algorithm, ~~using~~switching ~~the~~between or combining them as ~~former~~needed for ~~speech~~maximal efficiency.<ref name="homepage">{{cite web \|url = https://opus-codec.org/ \|title=Opus Codec \|work=Opus \|publisher=Xiph.org Foundation \|type=Home page \|access-date=July 31, 2012 }}</ref><ref>{{cite conference \|last1=Valin \|first1=Jean-Marc \|last2=Maxwell \|first2=Gregory \|last3=Terriberry \|first3=Timothy B. \|last4=Vos \|first4=Koen \|title=High-Quality, Low-Delay Music Coding in the Opus Codec \|conference=135th AES Convention \|publisher=[[Audio Engineering Society]] \|date=October 2013 \|arxiv=1602.04845 }}</ref> It is widely used for VoIP calls in [[WhatsApp]].<ref name="Register">{{cite news \|last1=Leyden \|first1=John \|title=WhatsApp laid bare: Info-sucking app's innards probed \|url=https://www.theregister.co.uk/2015/10/27/whatsapp_forensic_analysis/ \|access-date=19 October 2019 \|work=[[The Register]] \|date=27 October 2015}}</ref><ref name="Hazra">{{cite book \|last1=Hazra \|first1=Sudip \|last2=Mateti \|first2=Prabhaker \|chapter=Challenges in Android Forensics \|editor-last1=Thampi \|editor-first1=Sabu M. \|editor-last2=Pérez \|editor-first2=Gregorio Martínez \|editor-last3=Westphall \|editor-first3=Carlos Becker \|editor-last4=Hu \|editor-first4=Jiankun \|editor-last5=Fan \|editor-first5=Chun I. \|editor-last6=Mármol \|editor-first6=Félix Gómez \|title=Security in Computing and Communications: 5th International Symposium, SSCC 2017 \|date=September 13–16, 2017 \|publisher=Springer \|isbn=9789811068980 \|pages=286–299 (290) \|doi=10.1007/978-981-10-6898-0_24 \|chapter-url=https://books.google.com/books?id=1u09DwAAQBAJ&pg=PA290}}</ref><ref name="Srivastava">{{cite book \|last1=Srivastava \|first1=Saurabh Ranjan \|last2=Dube \|first2=Sachin \|last3=Shrivastaya \|first3=Gulshan \|last4=Sharma \|first4=Kavita \|chapter=Smartphone Triggered Security Challenges: Issues, Case Studies and Prevention ~~\|journal=Cyber Security in Parallel and Distributed Computing~~ \|editor-last1=Le \|editor-first1=Dac-Nhuong \|editor-last2=Kumar \|editor-first2=Raghvendra \|editor-last3=Mishra \|editor-first3=Brojo Kishore \|editor-last4=Chatterjee \|editor-first4=Jyotir Moy \|editor-last5=Khari \|editor-first5=Manju \|title=Cyber Security in Parallel and Distributed Computing: Concepts, Techniques, Applications and Case Studies \|date=2019 \|publisher=John Wiley & Sons \|isbn=9781119488057 \|pages=187–206 (200) \|doi=10.1002/9781119488330.ch12 \|s2cid=214034702 \|chapter-url=https://books.google.com/books?id=FzGtDwAAQBAJ&pg=PA200}}</ref> The [[PlayStation 4]] video game console also uses Opus for its [[PlayStation Network]] system party chat.<ref name="playstation">{{cite web\|url=https://doc.dl.playstation.net/doc/ps4-oss/ \|title=Open Source Software used in PlayStation4 \|publisher=Sony Interactive Entertainment Inc. \|access-date=2017-12-11}}{{fvfailed verification\|reason=Source does not indicate how Opus is used\|date=September 2022}}</ref> A number of codecs with even lower ~~bitrates~~[[bit rate]]s have been demonstrated. [[Codec2]], which operates at [[bit ~~rate]]s~~rates as low as {{nowrap\|450~~ ~~ bit/s}}, sees use in amateur radio.<ref>{{cite web \|title=GitHub - Codec2 \|website=[[GitHub]] \|date=November 2019 \|url=https://github.com/x893/codec2}}</ref> NATO currently uses [[~~Mixed-excitation linear prediction\|~~MELPe]], offering ~~legible~~intelligible speech at {{nowrap\|600~~ ~~ bit/s}} ~~(with~~and ~~one nonstandard variant halving the number)~~below.<ref>Alan McCree, “A scalable phonetic vocoder framework using joint predictive vector quantization of MELP parameters,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2006, pp. I 705–708, Toulouse, France</ref> Neural vocoder approaches have also emerged: [[Lyra (codec)\|Lyra]] by Google ~~takes~~gives an ~~unusual machine learning approach, giving~~ "almost eerie" quality at {{nowrap\|3~~ ~~ kbit/s}}.<ref ~~Microsoft’s~~name=":0">{{Cite web \|last=Buckley \|first=Ian \|date=2021-04-08 \|title=Google Makes Its Lyra Low Bitrate Speech Codec Public \|url=https://www.makeuseof.com/google-lyra-speech-codec-public/ \|access-date=2022-07-21 \|website=MakeUseOf \|language=en-US}}</ref> Microsoft's [[Satin (codec)\|Satin]] also uses machine learning, but uses a higher tunable bitrate and is wideband.<ref name=":3">{{Cite web \|last=Levent-Levi \|first=Tsahi \|date=2021-04-19 \|title=Lyra, Satin and the future of voice codecs in WebRTC \|url=https://bloggeek.me/lyra-satin-webrtc-voice-codecs/ \|access-date=2022-07-21 \|website=BlogGeek.me \|language=en-US}}</ref> ===Sub-fields=== Line 48 ⟶ 44: [[AMR-WB]] for [[WCDMA]] networks [[VMR-WB]] for [[CDMA2000]] networks ** [[Speex]], IP-MR, [[SILK]] ~~and~~(part of [[Opus (audio format)\|Opus]]), ~~for~~and [[~~voice~~Unified Speech and Audio Coding\|USAC/xHE-~~over-IP~~AAC]] (for VoIP) and [[videoconferencing]] * [[Modified discrete cosine transform]] (MDCT) ** [[AAC-LD]], [[G.722.1]], [[G.729.1]], [[CELT]] and [[Opus (audio format)\|Opus]] for VoIP and videoconferencing * [[Adaptive differential pulse-code modulation]] (ADPCM) ** [[G.722]] for VoIP * Neural speech coding [[Lyra (codec)\|Lyra]] (Google): V1 uses neural network reconstruction of log-mel spectrogram; V2 is an end-to-end [[autoencoder]]. [[Satin (codec)\|Satin]] (Microsoft) LPCNet (Mozilla, Xiph): neural network reconstruction of LPC features<ref>{{cite web \|title=LPCNet: Efficient neural speech synthesis \|url=https://github.com/xiph/LPCNet \|publisher=Xiph.Org Foundation \|date=8 August 2023}}</ref> ; [[Narrowband]] audio coding Line 58: [[FNBDT]] for military applications [[Selectable Mode Vocoder\|SMV]] for [[CDMA]] networks [[Full Rate]], [[Half Rate]], [[Enhanced ~~Full~~full ~~Rate~~rate\|EFR]] and [[Adaptive Multi-Rate audio codec\|AMR]] for [[GSM]] networks ** [[G.723.1]], [[G.728]], [[G.729]], [[G.729.1]] and [[iLBC]] for VoIP or videoconferencing * ADPCM ** [[G.726]] for VoIP * [[Multi-Band Excitation]] (MBE) [[Multi-Band Excitation\|AMBE+]] for [[digital radio\|digital]] [[mobile radio]] and [[satellite ~~telephone~~phone]] [[Codec 2]] Line 83: [[Category:Speech codecs\| ]] [[Category:Data compression]]