Voice computing: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 23:49, 1 March 2019 edit Beland (talk \| contribs) Autopatrolled, Administrators 259,166 edits m →See also: drop redundant link (redirect) ← Previous edit		Latest revision as of 22:22, 28 August 2025 edit undo InternetArchiveBot (talk \| contribs) Bots, Pending changes reviewers 5,697,703 edits Rescuing 3 sources and tagging 0 as dead.) #IABot (v2.0.9.5
(30 intermediate revisions by 14 users not shown)
Line 1: {{short description\|Discipline in computing}} ~~{{cleanup HTML\|date=February 2019}}~~ [[Image:Amazon Echo Plus 02.jpg\|thumb\|The [[Amazon Echo]], an example of a voice computer]] '''Voice computing''' is the discipline that develops hardware or software to process voice inputs.<ref>Schwoebel, J. (2018). An Introduction to Voice Computing in Python. Boston; Seattle, Atlanta: NeuroLex Laboratories. https://neurolex.ai/voicebook</ref> Line 8: ==History== Voice computing has a rich history.<ref>~~Timeline~~{{Cite ~~for~~web \|last=Boyd \|first=Clark \|date=2019-08-30 \|title=Speech Recognition. Technology: The Past, Present, and Future \|url=https://medium.com/swlh/the-past-present-and-future-of-speech-recognition-technology-cf13c179aaf \|access-date=2025-01-10 \|website=The Startup \|language=en}}</ref> First, scientists like [[Wolfgang Kempelen]] started to build speech machines to produce the earliest synthetic speech sounds. This led to further work by Thomas Edison to record audio with [[dictation machines]] and play it back in corporate settings. In the 1950s-1960s there were primitive attempts to build automated [[speech recognition]] systems by [[Bell Labs]], [[IBM]], and others. However, it was not until the 1980s that [[Hidden Markov Models]] were used to recognize up to 1,000 words that speech recognition systems became relevant. {\| class="wikitable" Line 49: ==Hardware== A ~~<strong>~~'''voice computer~~</strong>~~''' is assembled hardware and software to process voice inputs. Note that voice computers do not necessarily need a screen, such as in the traditional [[Amazon Echo]]. In other embodiments, traditional [[laptop computers]] or [[mobile phones]] could be used as voice computers. Moreover, there has become increasingly more interfaces for voice computers with the advent of [[Internet of things\|IoT]]-enabled devices, such as within cars or televisions. As of September 2018, there are currently over 20,000 types of devices compatible with Amazon Alexa.<ref>~~Voicebot.AI.~~{{Cite web \|last=Kinsella \|first=Bret \|date=2018-09-02 \|title=Amazon Alexa Now Has 50,000 Skills Worldwide, works with 20,000 Devices, Used by 3,500 Brands \|url=https://voicebot.ai/2018/09/02/amazon-alexa-now-has-50000-skills-worldwide-is-on-20000-devices-used-by-3500-brands/ \|access-date=2025-01-10 \|website=Voicebot.ai \|language=en-US}}</ref> ==Software== ~~<strong>~~Voice computing software~~</strong>~~ can read/write, record, clean, encrypt/decrypt, playback, transcode, transcribe, compress, publish, featurize, model, and visualize voice files. Here are some popular software packages related to voice computing: * <strong>[[FFmpeg]]</strong> - for [[transcoding]] audio files from one format to another (e.g. .WAV --> .MP3).<ref>FFmpeg. https://www.ffmpeg.org/</ref>▼ * <strong>[[Audacity (audio editor)\|Audacity]]</strong> - for recording and filtering audio.<ref>Audacity. https://www.audacityteam.org/</ref>▼ * <strong>[[SoX]]</strong> - for manipulating audio files and removing environmental noise.<ref>SoX. http://sox.sourceforge.net/</ref>▼ * <strong>Natural Language ToolKit</strong> - for featurizing transcripts with things like [[parts of speech]].<ref>NLTK. https://www.nltk.org/</ref>▼ * <strong>LibROSA</strong> - for visualizing audio file spectrograms and featurizing audio files.<ref>LibROSA. https://librosa.github.io/librosa/</ref>▼ * <strong>[[OpenSMILE]]</strong> - for featurizing audio files with things like mel-frequency cepstrum coefficients.<ref>OpenSMILE. https://www.audeering.com/technology/opensmile/</ref>▼ * <strong>PocketSphinx</strong> - for transcribing speech files into text.<ref>https://github.com/cmusphinx/pocketsphinx</ref> * <strong>Pyttsx3</strong> - for playing back audio files (text-to-speech).<ref>Pyttsx3. https://github.com/nateshmbhat/pyttsx3</ref>▼ * <strong>Pycryptodome</strong> - for encrypting and decrypting audio files.<ref>Pycryptodome. https://pycryptodome.readthedocs.io/en/latest/</ref>▼ * <strong>Alexa Voice Service</strong> - enables you to access cloud-based Alexa capabilities with the support of AVS APIs, hardware kits, software tools, and documentation.<ref>Alexa Voice Service. https://developer.amazon.com/alexa-voice-service</ref> ==Applications==▼ Voice computing applications span many industries including voice assistants, healthcare, e-Commerce, finance, supply chain, agriculture, text-to-speech, security, marketing, customer support, recruiting, cloud computing, microphones, speakers, and podcasting. Voice technology is projected to grow at a CAGR of 19-25% by 2025, making it an attractive industry for startups and investors alike.<ref>Businesswire. https://www.businesswire.com/news/home/20180417006122/en/Global-Speech-Voice-Recognition-Market-2018-Forecast</ref>▼ {\| class="wikitable" \|- ! ~~Use~~Package ~~case~~name ! Description ~~! Example Product or Startup~~ \|- \|! [[~~Voice assistants~~FFmpeg]] ▲* <strong>[[FFmpeg]]</strong> -\| for [[transcoding]] audio files from one format to another (e.g. .WAV --> .MP3).<ref>FFmpeg. https://www.ffmpeg.org/</ref> \| [[Cortana]],<ref>Cortana. https://www.microsoft.com/en-us/cortana</ref> [[Amazon Alexa]],<ref>Amazon Alexa. https://developer.amazon.com/alexa</ref> [[Siri]],<ref>Siri. https://www.apple.com/siri/</ref> [[Google Assistant]],<ref>Google Assistant. https://assistant.google.com/#?modal_active=none</ref> [[Apple HomePod]],<ref>HomePod. https://www.apple.com/homepod/</ref> [[Jasper]],<ref>Jasper https://jasperproject.github.io/</ref> and Nala.<ref>Nala. https://github.com/jim-schwoebel/nala</ref> \|- ! [[Audacity (audio editor)\|Audacity]] ~~\| [[Healthcare]]~~ ▲* <strong>[[Audacity (audio editor)\|~~Audacity]]</strong> -~~ for recording and filtering audio.<ref>Audacity. https://www.audacityteam.org/</ref> \| Cardiocube,<ref>Cardiocube. https://www.cardiocube.com/</ref> Toneboard,<ref>Toneboard. https://toneboard.com/</ref> Suki,<ref>Suki. https://www.suki.ai/</ref> Praktice.ai,<ref>Praktice.ai. https://praktice.ai/</ref> Corti,<ref>Corti. https://corti.ai/</ref> and Syllable.<ref>Syllable. https://www.syllable.ai/</ref> \|- \|! [[~~e-Commerce~~SoX]] ▲* <strong>[[SoX]]</strong> -\| for manipulating audio files and removing environmental noise.<ref>SoX. ~~http~~https://sox.sourceforge.net/</ref> \| Cerebel,<ref>Cerebel. https://map.startuplithuania.lt/companies/cerebel</ref> Voysis,<ref>Voysis. https://voysis.com/</ref> Mindori,<ref>Mindori. http://mindori.com/</ref> Twiggle,<ref>Twiggle. https://www.twiggle.com/</ref> and Addstructure.<ref>AddStructure. https://www.crunchbase.com/organization/addstructure</ref> \|- ! [[Natural Language Toolkit]] ~~\| [[Finance]]~~ ▲* <strong>Natural Language ToolKit</strong> -\| for featurizing transcripts with things like [[parts of speech]].<ref>NLTK. https://www.nltk.org/</ref> \| Kasisto,<ref>Kasisto. https://kasisto.com/</ref> Personetics,<ref>Personetics. https://personetics.com/</ref> Voxo,<ref>Voxo. https://www.voxo.ai/</ref> and Active Intelligence.<ref>Active Intelligence. https://active.ai/</ref> \|- ~~\| [[Supply Chain]] and [[Manufacturing]]~~ \| Augury,<ref>Augury. https://www.augury.com/</ref> Kextil,<ref>Kextil. http://www.kextil.com/</ref> 3DSignals,<ref>3DSignals. https://www.3dsig.com/</ref> Voxware,<ref>Voxware. https://www.voxware.com/</ref> and Otosense.<ref>Otosense. https://www.otosense.com/</ref> \|- ! LibROSA ~~\| [[Agriculture]]~~ ▲* <strong>LibROSA</strong> -\| for visualizing audio file spectrograms and featurizing audio files.<ref>LibROSA. https://librosa.github.io/librosa/</ref> ~~\| Agvoice.<ref>Agvoice. https://agvoiceglobal.com/</ref>~~ \|- ~~\| [[Text-to-speech]]~~ ~~\| Lyrebyrd <ref>Lyrebird. https://lyrebird.ai/</ref> and VocalID.<ref>VocalD. https://vocalid.ai/</ref>~~ \|- \|! [[~~Security~~OpenSMILE]] ▲* <strong>[[OpenSMILE]]</strong> -\| for featurizing audio files with things like mel-frequency cepstrum coefficients.<ref>OpenSMILE. https://www.audeering.com/technology/opensmile/</ref> ~~\| Pindrop security <ref>Pindrop. https://www.pindrop.com/</ref> and Aimbrain.<ref>Aimbrain. https://aimbrain.com/</ref>~~ \|- ~~\| [[Marketing]]~~ \| Convirza,<ref>Convirza. https://www.convirza.com/</ref> Dialogtech,<ref>Dialogtech. https://www.dialogtech.com/</ref> Invoca,<ref>Invoca. https://www.invoca.com/</ref> and Veritonic.<ref>Veritonic. https://veritonic.com/</ref> \|- ~~\| [[Customer support]]~~ \| Cogito.,<ref>Cogito. https://www.cogitocorp.com/</ref> Afiniti,<ref>Afiniti. https://www.afiniti.com/</ref> Aaron.ai,<ref>Aaron.ai. https://aaron.ai/</ref> Blueworx,<ref>Blueworx. https://www.blueworx.com/</ref> Servo.ai,<ref>Servo.ai. https://www.servo.ai/</ref> [[SmartAction]], and Chatdesk.<ref>Chatdesk. https://chatdesk.com/</ref> \|- ! [[CMU Sphinx]] ~~\| [[Recruitment\|Recruiting]]~~ \| for transcribing speech files into text.<ref>{{Cite web \| url=https://github.com/cmusphinx/pocketsphinx \|title = PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop: Cmusphinx/Pocketsphinx\|website = [[GitHub]]\|date = 29 March 2020}}</ref> ~~\| SurveyLex <ref>SurveyLex. https://www.surveylex.com/</ref> and Voice glance.<ref>Voice glance. https://voiceglance.com/</ref>~~ \|- ~~\| [[Speech-to-text]]~~ \| Voicebase,<ref>Voicebase. https://www.voicebase.com/</ref> Speechmatics,<ref>Speechmatics. https://www.speechmatics.com/</ref> Capio,<ref>Capio. https://www.capio.ai/</ref> [[Nuance Communications\|Nuance]], and Spitch.<ref>Spitch. https://www.spitch.ch/</ref> \|- ! Pyttsx3 ~~\| [[Cloud computing]]~~ ▲* <strong>Pyttsx3</strong> -\| for playing back audio files (text-to-speech).<ref>Pyttsx3. https://github.com/nateshmbhat/pyttsx3</ref> \| AWS,<ref>AWS. https://aws.amazon.com/</ref> GCP,<ref>GCP. https://cloud.google.com/</ref> IBM Watson,<ref>IBM Watson. https://www.ibm.com/watson/</ref> and Microsoft Azure.<ref>Microsoft Azure. https://azure.microsoft.com/en-us/</ref> \|- ! Pycryptodome ~~\| [[Microphone]]/[[Loudspeaker\|speaker]] design~~ ▲* <strong>Pycryptodome</strong> -\| for encrypting and decrypting audio files.<ref>Pycryptodome. https://pycryptodome.readthedocs.io/en/latest/</ref> \| Bose <ref>Bose speakers. https://www.bose.com/en_us/shop_all/speakers/speakers.html</ref> and Audio Technica.<ref>Audio Technica. https://www.audio-technica.com/cms/site/c35da94027e94819/index.html</ref> \|- ! AudioFlux ~~\| [[Podcasting]]~~ \| ~~Anchor~~for ~~<ref>Anchor. https://anchor.fm/</ref>~~audio and ~~iTunes~~music analysis, feature extraction.<ref>~~iTunes~~AudioFlux. https://~~www.apple~~github.com/~~itunes~~libAudioFlux/audioFlux/</ref> \|} ▲==Applications== ==Legal considerations==▼ ▲Voice computing applications span many industries including voice assistants, healthcare, e-Commerce, finance, supply chain, agriculture, text-to-speech, security, marketing, customer support, recruiting, cloud computing, microphones, speakers, and podcasting. Voice technology is projected to grow at a CAGR of 19-25% by 2025, making it an attractive industry for startups and investors alike.<ref>~~Businesswire~~{{Cite news \|title=Global Speech and Voice Recognition Market 2018 Forecast to 2025 - CAGR Expected to Grow at 25.7% - ResearchAndMarkets.com \|url=https://www.businesswire.com/news/home/20180417006122/en/Global-Speech-Voice-Recognition-Market-2018-Forecast \|archive-url=https://web.archive.org/web/20240119171935/https://www.businesswire.com/news/home/20180417006122/en/Global-Speech-Voice-Recognition-Market-2018-Forecast \|archive-date=2024-01-19 \|access-date=2025-01-10 \|language=en \|url-status=live }}</ref> In the United States different states have varying [[telephone recording laws]]. In some states, it is legal to record a conversation with the consent of only one party, in others the consent of all parties is required.▼ ▲==Legal considerations== Moreover, [[COPPA]] is a significant law to protect minors on the internet. With an increasing number of minors interacting with voice computing devices (e.g. the Amazon Alexa), the [[Federal Trade Commission]] recently relaxed the COPAA rule so that kids can issue voice searches and commands.<ref>Techcrunch. https://techcrunch.com/2017/10/24/ftc-relaxes-coppa-rule-so-kids-can-issue-voice-searches-and-commands/</ref>▼ ▲In the United States, ~~different~~the states have varying [[telephone call recording laws]]. In some states, it is legal to record a conversation with the consent of only one party, in others the consent of all parties is required. ▲Moreover, [[COPPA]] is a significant law to protect minors onusing the ~~internet~~Internet. With an increasing number of minors interacting with voice computing devices (e.g. the Amazon Alexa), on October 23, 2017 the [[Federal Trade Commission]] ~~recently~~ relaxed the COPAA rule so that ~~kids~~children can issue voice searches and commands.<ref>~~Techcrunch.~~{{Cite web \|last=Coldewey \|first=Devin \|date=2017-10-24 \|title=FTC relaxes COPPA rule so kids can issue voice searches and commands \|url=https://techcrunch.com/2017/10/24/ftc-relaxes-coppa-rule-so-kids-can-issue-voice-searches-and-commands/ \|access-date=2025-01-10 \|website=TechCrunch \|language=en-US}}</ref><ref>{{cite web \| url=https://www.federalregister.gov/documents/2017/12/08/2017-26509/enforcement-policy-statement-regarding-the-applicability-of-the-coppa-rule-to-the-collection-and-use \| title=Federal Register :: Request Access \| date=8 December 2017 }}</ref> Lastly, [[GDPR]] is a new European law that governs the [[right to be forgotten]] and many other clauses for EU citizens. GDPR also is clear that companies need to outline clear measures to obtain consent if audio recordings are made and define the purpose and scope as to how these recordings will be used (e.g. for training purposes). The bar for valid consent has been raised much higher under the GDPR. Consents must be freely given, specific, informed, and unambiguous; tacit consent is no longer be enough.<ref>IAPP. https://iapp.org/news/a/how-do-the-rules-on-audio-recording-change-under-the-gdpr/</ref>▼ ▲Lastly, [[GDPR]] is a new European law that governs the [[right to be forgotten]] and many other clauses for EU citizens. GDPR also is clear that companies need to outline clear measures to obtain consent if audio recordings are made and define the purpose and scope as to how these recordings will be used, (e.g., for training purposes). The bar for valid consent has been raised ~~much higher~~ under the GDPR. Consents must be freely given, specific, informed, and unambiguous; tacit consent is no longer ~~be enough~~sufficient.<ref>IAPP. https://iapp.org/news/a/how-do-the-rules-on-audio-recording-change-under-the-gdpr/</ref> ~~All these things make it quite unclear how voice computing technology will be regulated into the future.~~ ==Research conferences== Line 137 ⟶ 110: * [[International Conference on Acoustics, Speech, and Signal Processing]] * Interspeech <ref>Interspeech 2018. http://interspeech2018.org/</ref> * AVEC <ref>{{Cite web \|title=14th International Symposium on Advanced Vehicle Control - Speakers, Sessions, Agenda \|url=https://www.eventyco.com/event/14th-international-symposium-on-advanced-vehicle-control \|access-date=2025-01-10 \|website=www.eventyco.com}}</ref> * AVEC <ref>AVEC 2018. http://avec2018.org/</ref> * IEEE Int'l Conf. on Automatic Face and Gesture Recognition <ref>2018 FG. https://fg2018.cse.sc.edu/ {{Webarchive\|url=https://web.archive.org/web/20180511185841/https://fg2018.cse.sc.edu/ \|date=2018-05-11 }}</ref> * ACII2019 The 8th Int'l Conf. on Affective Computing and Intelligent Interaction <ref>ASCII 2019. http://acii-conf.org/2019/</ref> ==Developer community== Google Assistant has roughly 2,000 actions as of January 2018.<ref>~~Voicebot~~{{Cite web \|last=Mutchler \|first=Ava \|date=2018-01-24 \|title=Google Assistant App Total Reaches Nearly 2400.ai But That’s Not the Real Number. It’s really 1719. \|url=https://voicebot.ai/2018/01/24/google-assistant-app-total-reaches-nearly-2400-thats-not-real-number-really-1719/ \|access-date=2025-01-10 \|website=Voicebot.ai \|language=en-US}}</ref> There are over 50,000 Alexa skills worldwide as of September 2018.<ref>~~Voicebot.ai.~~{{Cite web \|last=Kinsella \|first=Bret \|date=2018-09-02 \|title=Amazon Alexa Now Has 50,000 Skills Worldwide, works with 20,000 Devices, Used by 3,500 Brands \|url=https://voicebot.ai/2018/09/02/amazon-alexa-now-has-50000-skills-worldwide-is-on-20000-devices-used-by-3500-brands/ \|access-date=2025-01-10 \|website=Voicebot.ai \|language=en-US}}</ref> In June 2017, [[Google]] released AudioSet,<ref>Google AudioSet. https://research.google.com/audioset/</ref> a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos. It contains 1,010,480 videos of human speech files, or 2,793.5 hours in total.<ref>~~Audioset~~{{Cite ~~data.~~web \|title=AudioSet \|url=https://research.google.com/audioset/dataset/speech.html \|access-date=2025-01-10 \|website=research.google.com}}</ref> It was released as part of the IEEE ICASSP 2017 Conference.<ref>Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, & Ritter, M. (2017, March). Audio set: An ontology and human-labeled dataset for audio events. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 776-780). IEEE.</ref> In November 2017, [[Mozilla Foundation]] released the Common Voice Project, a collection of speech files to help contribute to the larger open source machine learning community.<ref>Common Voice Project. https://voice.mozilla.org/ {{Webarchive\|url=https://web.archive.org/web/20200227020208/https://voice.mozilla.org/ \|date=2020-02-27 }}</ref><ref>~~Common~~{{Cite web \|title=Announcing the Initial Release of Mozilla’s Open Source Speech Recognition Model and Voice ~~Project.~~Dataset {{!}} The Mozilla Blog \|url=https://blog.mozilla.org/~~blog~~en/~~2017/11/29~~mozilla/announcing-the-initial-release-of-mozillas-open-source-speech-recognition-model-and-voice-dataset/ \|access-date=2025-01-10 \|website=blog.mozilla.org \|language=en-US}}</ref> The voicebank is currently 12GB in size, with more than 500 hours of English-language voice data that have been collected from 112 countries since the project's inception in June 2017.<ref>Mozilla's large repository of voice data will shape the future of machine learning. https://opensource.com/article/18/4/common-voice</ref> This dataset has already resulted in creative projects like the DeepSpeech model, an open source transcription model.<ref>DeepSpeech. https://github.com/mozilla/DeepSpeech</ref> ==See also== [[Speech ~~Recognition~~recognition]] [[Natural ~~Language~~language ~~Processing~~processing]] [[Voice user interface]] [[Audio codec]] [[~~Ubicomp~~Ubiquitous computing]] [[Hands-free computing]] Line 167 ⟶ 140: [[Category:Computational linguistics]] [[Category:Computational fields of study]] ~~[[Category:Artificial intelligence]]~~