Content deleted Content added
m avoid bit/s wrap at slash |
|||
(46 intermediate revisions by 14 users not shown) | |||
Line 1:
{{Short description|Lossy audio compression applied to human speech}}
{{Use American English|date=May 2022}}
{{more citations needed|date=January 2013}}
'''Speech coding''' is an application of [[data compression]]
The techniques employed in speech coding are similar to those used in [[audio data compression]] and [[audio coding]] where
Speech coding differs from other forms of audio coding in that speech is a simpler signal than most other audio signals, and a lot more statistical information is available about the properties of speech. As a result, some auditory information that is relevant in audio coding can be unnecessary in the speech coding context. In speech coding, the most important criterion is preservation of intelligibility and ''pleasantness'' of speech, with a constrained amount of transmitted data.<ref>P. Kroon, "Evaluation of speech coders," in Speech Coding and Synthesis, W. Bastiaan Kleijn and K. K. Paliwal, Ed., Amsterdam: Elsevier Science, 1995, pp. 467-494.</ref> In addition, most speech applications require low coding delay, as long coding delays interfere with speech interaction.<ref>J. H. Chen, R. V. Cox, Y.-C. Lin, N. S. Jayant, and M. J. Melchner, A low-delay CELP coder for the CCITT 16 kb/s speech coding standard. IEEE J. Select. Areas Commun. 10(5): 830-849, June 1992.</ref>▼
▲Speech coding differs from other forms of audio coding in that speech is a simpler signal than
== Categories ==
Speech coders are of two
# Waveform coders
#* Time-___domain: [[PCM]], [[ADPCM]]
#* Frequency-___domain: [[sub-band coding]], [[
# [[Vocoder]]s
#* [[Linear predictive coding]] (LPC)
#* [[Formant synthesis|Formant coding]]
#* [[Machine learning]], i.e. [[Deep learning speech synthesis#Neural vocoder|neural vocoder]]<ref>{{cite journal |last1=Zeghidour |first1=Neil |last2=Luebs |first2=Alejandro |last3=Omran |first3=Ahmed |last4=Skoglund |first4=Jan |last5=Tagliasacchi |first5=Marco |title=SoundStream: An End-to-End Neural Audio Codec |journal=IEEE/ACM Transactions on Audio, Speech, and Language Processing |date=2022 |volume=30 |pages=495–507 |doi=10.1109/TASLP.2021.3129994|arxiv=2107.03312|s2cid=236149944 }}</ref>
== Sample companding viewed as a form of speech coding ==
The [[
A wide variety of other algorithms were tried at the time, mostly [[delta modulation]] variants, but after careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made an excellent engineering compromise. Their audio performance remains acceptable, and there was no need to replace them in the stationary phone network.{{citation needed|date=July 2023}}
In 2008, [[G.711.1]] codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.<ref name="g711-1-2012">{{citation |publisher=ITU-T |date=2012 |url=http://www.itu.int/rec/T-REC-G.711.1/en |title=G.711.1 : Wideband embedded extension for G.711 pulse code modulation |access-date=2022-12-24}}</ref>
== Modern speech compression ==
Much of the later work in speech compression was motivated by military research into digital communications for [[Secure voice|secure military radios]], where very low data rates were
The most widely used speech coding algorithms are based on [[linear predictive coding]] (LPC).<ref>{{cite journal |last1=Gupta |first1=Shipra |title=Application of MFCC in Text Independent Speaker Recognition |journal=International Journal of Advanced Research in Computer Science and Software Engineering |date=May 2016 |volume=6 |issue=5 |pages=805–810 (806) |s2cid=212485331 |issn=2277-128X |url=https://pdfs.semanticscholar.org/2aa9/c2971342e8b0b1a0714938f39c406f258477.pdf |archive-url=https://web.archive.org/web/20191018231621/https://pdfs.semanticscholar.org/2aa9/c2971342e8b0b1a0714938f39c406f258477.pdf |url-status=dead |archive-date=2019-10-18 |access-date=18 October 2019}}</ref> In particular, the most common speech coding scheme is the LPC-based [[code-excited linear prediction]] (CELP) coding, which is used for example in the [[GSM]] standard. In CELP, the modeling is divided in two stages, a [[linear prediction|linear predictive]] stage that models the spectral envelope and a code-book-based model of the residual of the linear predictive model. In CELP, linear prediction coefficients (LPC) are computed and quantized, usually as [[line spectral pairs]] (LSPs). In addition to the actual speech coding of the signal, it is often necessary to use [[channel coding]] for transmission, to avoid losses due to transmission errors. In order to get the best overall coding results, speech coding and channel coding methods are chosen in pairs, with the more important bits in the speech data stream protected by more robust channel coding.
The [[modified discrete cosine transform]] (MDCT)
[[Opus (audio format)|Opus]] is a [[free software]]
A number of codecs with even lower [[bit rate]]s have been demonstrated. [[Codec2]], which operates at bit rates as low as {{nowrap|450 bit/s}}, sees use in amateur radio.<ref>{{cite web |title=GitHub - Codec2 |website=[[GitHub]] |date=November 2019 |url=https://github.com/x893/codec2}}</ref> NATO currently uses [[MELPe]], offering intelligible speech at {{nowrap|600 bit/s}} and below.<ref>Alan McCree, “A scalable phonetic vocoder framework using joint predictive vector quantization of MELP parameters,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2006, pp. I 705–708, Toulouse, France</ref> Neural vocoder approaches have also emerged: [[Lyra (codec)|Lyra]] by Google gives an "almost eerie" quality at {{nowrap|3 kbit/s}}.<ref name=":0">{{Cite web |last=Buckley |first=Ian |date=2021-04-08 |title=Google Makes Its Lyra Low Bitrate Speech Codec Public |url=https://www.makeuseof.com/google-lyra-speech-codec-public/ |access-date=2022-07-21 |website=MakeUseOf |language=en-US}}</ref> Microsoft's [[Satin (codec)|Satin]] also uses machine learning, but uses a higher tunable bitrate and is wideband.<ref name=":3">{{Cite web |last=Levent-Levi |first=Tsahi |date=2021-04-19 |title=Lyra, Satin and the future of voice codecs in WebRTC |url=https://bloggeek.me/lyra-satin-webrtc-voice-codecs/ |access-date=2022-07-21 |website=BlogGeek.me |language=en-US}}</ref>
===Sub-fields===
Line 49 ⟶ 44:
** [[AMR-WB]] for [[WCDMA]] networks
** [[VMR-WB]] for [[CDMA2000]] networks
** [[Speex]], IP-MR, [[SILK]]
* [[Modified discrete cosine transform]] (MDCT)
** [[AAC-LD]], [[G.722.1]], [[G.729.1]], [[CELT]] and [[Opus (audio format)|Opus]] for VoIP and videoconferencing
* [[Adaptive differential pulse-code modulation]] (ADPCM)
** [[G.722]] for VoIP
* Neural speech coding
** [[Lyra (codec)|Lyra]] (Google): V1 uses neural network reconstruction of log-mel spectrogram; V2 is an end-to-end [[autoencoder]].
** [[Satin (codec)|Satin]] (Microsoft)
** LPCNet (Mozilla, Xiph): neural network reconstruction of LPC features<ref>{{cite web |title=LPCNet: Efficient neural speech synthesis |url=https://github.com/xiph/LPCNet |publisher=Xiph.Org Foundation |date=8 August 2023}}</ref>
; [[Narrowband]] audio coding
Line 59 ⟶ 58:
** [[FNBDT]] for military applications
** [[Selectable Mode Vocoder|SMV]] for [[CDMA]] networks
** [[Full Rate]], [[Half Rate]], [[Enhanced
** [[G.723.1]], [[G.728]], [[G.729]], [[G.729.1]] and [[iLBC]] for VoIP or videoconferencing
* ADPCM
** [[G.726]] for VoIP
* [[Multi-Band Excitation]] (MBE)
** [[Multi-Band Excitation|AMBE+]] for [[digital radio|digital]] [[mobile radio]] and [[satellite phone]]
** [[Codec 2]]
== See also ==
Line 79 ⟶ 77:
==External links==
* [http://www.itu.int/net/itu-t/sigdb/genaudio/Pseries.htm ITU-T Test Signals for Telecommunication Systems Test Samples]
* [http://www.itu.int/rec/T-REC-P.862/ ITU-T Perceptual evaluation of speech quality (PESQ) tool Sources]
Line 86 ⟶ 83:
[[Category:Speech codecs| ]]
[[Category:Data compression]]
|