Content deleted Content added
m →Managed code Speech API: Replace link to COM Interop with a short explanation |
Tags: Mobile edit Mobile web edit Advanced mobile edit |
||
(314 intermediate revisions by more than 100 users not shown) | |||
Line 1:
{{Short description|Application programming interface for Microsoft Windows}}
The '''Speech Application Programming Interface''' or '''SAPI''' is an [[Application programming interface|API]] developed by [[Microsoft]] to allow the use of [[
In general, all versions of the API have been designed such that a software developer can write an application to perform
In general, the Speech API is a freely
There have been two main 'families' of the Microsoft Speech API. SAPI versions 1 through 4 are all similar to each other, with extra features in each newer version. SAPI 5, however, was a completely new interface, released in 2000. Since then several sub-versions of this API have been released.
==Basic architecture==
In SAPI 5 however, applications and engines do not directly communicate with each other. Instead, each
Typically in SAPI 5 applications issue calls through the API (for example to load a recognition grammar; start recognition; or provide text to be synthesized). The sapi.dll runtime component interprets these commands and processes them, where necessary calling on the engine through the engine interfaces (for example, the loading of
In
*''API definition files'' - in [[MIDL]] and as C or C++ header files.
*''Runtime components'' - e.g. sapi.dll.
Line 26 ⟶ 25:
*''Text-To-Speech engines'' in multiple languages.
*''Speech Recognition engines'' in multiple languages.
*''Redistributable components'' to allow developers to package the engines and runtime with their [[application code]] to produce a single installable application.
*''Sample application code''.
*''Sample engines'' - implementations of the necessary engine interfaces but with no true speech
*''Documentation''.
==Versions==
[[Xuedong Huang]] was a key person who led Microsoft's early SAPI efforts.
===SAPI 1-4 API family===<!-- This section is linked from [[Speech synthesis]] -->
====SAPI 1====
The first version of SAPI was released in 1995, and was supported on [[Windows 95]] and [[Windows NT 3.51]]. This version included low-level Direct Speech Recognition and Direct Text To Speech APIs which
====SAPI 3====
SAPI 3.0 was released in 1997. It added limited support for dictation speech recognition (discrete speech, not continuous), and additional sample
====SAPI 4====
SAPI 4.0 was released in 1998. This version of SAPI included both the core [[Component Object Model|COM]] API; together with [[C++]] wrapper classes to make programming from C++ easier; and [[ActiveX]] controls to allow drag-and-drop [[Visual Basic]] development. This was shipped as part of an SDK that included recognition and synthesis engines. It also shipped
The main components of the SAPI 4 API
*'''Voice Command''' - high-level objects for command & control speech recognition
*'''Voice Dictation''' - high-level objects for continuous dictation speech recognition
Line 57 ⟶ 53:
*'''Audio objects''' - for reading to and from an audio device or file
===SAPI 5 API family===<!-- This section is linked from [[Speech synthesis]] -->
<!-- SAPI is not released or shipped with an SDK. It's part of Windows, the SDK just enables development-->
The '''Speech SDK version 5.0''', incorporating the '''SAPI 5.0''' runtime was released in 2000. This was a complete redesign from previous versions and neither engines nor applications which used older versions of SAPI could use the new version without considerable modification.
The design of the new API included the concept of strictly separating the application and engine so all calls were routed
The new API was initially a pure COM API and could be used easily only from C/C++. Support for VB and scripting languages were added later. Operating systems from [[Windows 98]] and [[NT 4.0]] upwards were supported.
Major features of the API include:
*'''Shared Recognizer'''. For desktop speech recognition applications, a recognizer object can be used that runs in a separate process ('''sapisvr.exe'''). All applications using the shared recognizer communicate with this single instance. This allows sharing of resources, removes contention for the microphone and allows for a global UI for control of all speech
*'''In-proc recognizer'''. For
*'''Grammar objects'''. Speech grammars are used to specify the words that the recognizer is listening for. SAPI 5 defines an [[XML]] markup for specifying a grammar, as well as mechanisms to create them dynamically in code. Methods also exist for instructing the recognizer to load a built-in dictation language model.
*'''Voice object'''. This performs speech synthesis, producing an audio stream from a text. A markup language
*'''Audio interfaces'''. The runtime includes objects for performing speech input from the microphone or speech output to speakers
*'''User lexicon object'''. This allows custom words and pronunciations to be added by a user or application. These are added to the recognition or synthesis engine's built-in lexicons.
*'''Object tokens'''. This is a concept allowing recognition and TTS engines, audio objects, lexicons and other categories of an object to be registered, enumerated and instantiated in a common way.
====SAPI 5.0====
This version shipped in late 2000 as part of the Speech SDK version 5.0, together with version 5.0 recognition and synthesis engines. The recognition engines supported continuous dictation and command & control and were released in U.S. English, Japanese and [[Simplified chinese|Simplified Chinese]] versions. In the U.S. English system, special acoustic models were available for children's speech and telephony speech. The synthesis engine was available in English and Chinese. This version of the API and recognition engines also shipped in Microsoft Office XP in 2001.
====SAPI 5.1====
This version shipped in late 2001 as part of the Speech SDK version 5.1. Automation-compliant interfaces were added to the API to allow use from Visual Basic, scripting languages such as [[JScript]], and [[managed code]]. This version of the API and TTS engines were shipped in [[Windows XP]].
====SAPI 5.2====
This was a special version of the API for use only in the [[Microsoft Speech Server]] which shipped in 2004. It added support for [[Speech Recognition Grammar Specification|SRGS]] and [[Speech Synthesis Markup Language|SSML]] mark-up languages, as well as additional server features and performance improvements. The Speech Server also shipped with the version 6 desktop recognition engine and the version 7 server recognition engine.
====SAPI 5.3====
This is the version of the API that
* Support for W3C XML speech grammars for recognition and synthesis. The [[Speech Synthesis Markup Language]] (SSML) version 1.0 provides the ability to mark up voice characteristics, speed, volume, pitch, emphasis, and pronunciation.
* The [[Speech Recognition Grammar Specification]] (SRGS) supports the definition of context-free grammars, with two limitations:
** It does not support the use of SRGS to specify dual-tone modulated-frequency (touch-tone) grammars.
** It does not support [[Augmented Backus–Naur form]] (ABNF).
* Support for semantic interpretation script within grammars. SAPI 5.3 enables an SRGS grammar to be annotated with [[JavaScript]] for semantic interpretation to supplement the recognized text.
* User-Specified shortcuts in lexicons, which is the ability to add a string to the lexicon and associate it with a shortcut word. When dictating, the user can say the shortcut word and the recognizer will return the expanded string.
* Additional functionality and ease-of-programming provided by new types.
* Performance improvements, improved reliability, and security.
* Version 8 of the speech recognition engine ("Microsoft Speech Recognizer")
====SAPI 5.4====
This is an updated version of the API that ships in [[Windows 7]].
===SAPI 5 Voices===
{{main|Microsoft text-to-speech voices}}
[[Microsoft text-to-speech voices#Windows 2000 and Windows XP|Microsoft Sam]] is a commonly shipped SAPI 5 voice. In addition, [[Microsoft Office XP]] and [[Office 2003]] installed [[Lernout & Hauspie|L&H]] Michael and Michelle voices. The SAPI 5.1 SDK installs 3 more voices, ''[[Microsoft text-to-speech voices#Windows 2000 and Windows XP|Mike]]'', ''[[Microsoft text-to-speech voices#Windows 2000 and Windows XP|Mary]]'', and an additional testing voice known as ''[[Microsoft text-to-speech voices#Windows 2000 and Windows XP|Sample TTS Voice]]'' that uses prerecorded voice recordings instead of synthesized voices. [[Windows Vista]] and [[Windows 7|7]] includes [[Microsoft text-to-speech voices#Windows Vista and Windows 7|Microsoft Anna]] which replaces Microsoft Sam and sounds more natural and intelligible; it is also installed on Windows XP by [[Microsoft Streets & Trips]] 2006 and later versions. The Chinese version of Vista and 7 also includes a female voice named [[Microsoft text-to-speech voices#Windows Vista and Windows 7|Microsoft Lili]]. [[Windows 8]] and later Windows client versions includes [[Microsoft text-to-speech voices#Windows 8 and Windows 8.1|Microsoft David]], [[Microsoft text-to-speech voices#Windows 8 and Windows 8.1|Zira]], and [[Microsoft text-to-speech voices#Windows 8 and Windows 8.1|Hazel]], the latter of which is only included by default on Windows 8 and [[Windows 8.1|8.1]]. These voices replaced Microsoft Anna and sounds more natural and intelligible than previous voices.
===Managed code Speech API===
A [[managed code]] API ships as part of the [[.NET Framework 3.0]].<ref name="Speech synthesis and recognition in .NET - Give applications a voice">{{cite web
| author=Michael Dunn
| title=Speech synthesis and recognition in .NET - Give applications a voice
| publisher=Redmond Developer News
| url=http://reddevnews.com/articles/2007/02/15/give-applications-a-voice.aspx?sc_lang=en
| access-date=2011-11-09
}}
{{webarchive |url=https://web.archive.org/web/20100114122117/http://reddevnews.com/articles/2007/02/15/give-applications-a-voice.aspx |date=14 January 2010}}</ref> It has similar functionality to SAPI 5 but is more suitable to be used by managed code applications. The new API is available on [[Windows XP]], [[Windows Server 2003]], [[Windows Vista]], and [[Windows Server 2008]].
The existing SAPI 5 API can also be used from managed code to a limited extent by creating a COM Interop code (helper code designed to assist in accessing COM interfaces and classes). This works well in some scenarios however the new API should provide a more seamless experience equivalent to using any other managed code library.
However, major obstacle towards transitioning from the COM Interop is the fact that the managed implementation has subtle [[memory leak]]s which lead to memory fragmentation and exclude the use of the library in any non-trivial applications. As a workaround, Microsoft has suggested using a different API, which has fewer voices.<ref>[http://connect.microsoft.com/VisualStudio/feedback/details/664196/system-speech-has-a-memory-leak System. Speech has a memory leak | Microsoft Connect]. Connect.microsoft.com. Retrieved on 2013-09-27.</ref>
==Speech functionality in Windows Vista==
{{see also|Windows Speech Recognition}}
[[Windows Vista]] includes a number of new speech-related features including:
* Speech control of the full Windows [[Graphical user interface|GUI]] and applications
* New tutorial, microphone wizard, and UI for controlling speech recognition
* New version of the Speech API runtime:
* Built-in updated Speech Recognition engine (Version 8)
* New Speech Synthesis engine and SAPI voice [[Microsoft Anna]]
* [[Managed code]] speech API (codenamed SpeechFX)
* Speech recognition support for 8 languages at release time: U.S. English, U.K. English, traditional Chinese, simplified Chinese, Japanese, Spanish, French, and German, with more language to be released later.
[[Microsoft Agent]] most notably, and all other Microsoft speech applications use SAPI 5.
==Compatibility==
The Speech API is compatible with the following operating systems:<ref name="SAPI compatibility">{{cite web
| author=Microsoft Corporation
| title=SAPI System Requirements
| publisher=MSDN
| url=http://msdn.microsoft.com/library/default.asp?url=/library/en-us/SAPI51sr/html/system_requirements.asp
| access-date=2006-04-12
| archive-url=https://web.archive.org/web/20050822175601/http://msdn.microsoft.com/library/en-us/SAPI51sr/html/system_requirements.asp
| archive-date=2005-08-22
}}
</ref><ref name="SAPI5SysReq">{{Cite web|url=https://documentation.help/SAPI-5/system_requirements.htm|title=Welcome to the Microsoft Speech SDK - Microsoft Speech SDK Documentation|website=documentation.help|access-date=2025-06-20}}</ref>
===SAPI 5===
List as of SAPI version 5.1:<ref name="SAPI compatibility" /><ref name="SAPI5SysReq" />
*[[Windows Server 2003|Microsoft Windows Server 2003]]
*[[Windows XP|Microsoft Windows XP]] (Home Edition, Professional, etc.)
*[[Windows Me|Microsoft Windows Millennium Edition]]
*[[Windows 2000|Microsoft Windows 2000]]
*[[Windows 98|Microsoft Windows 98]]
*[[Windows NT 4.0|Microsoft Windows NT 4.0]], Service Pack 6a, in English, Japanese and Simplified Chinese.
Later versions of SAPI 5 (e.g. SAPI 5.3 and above) are compatible with the following operating systems:
*[[Windows Server|Microsoft Windows Server]] releases from [[Windows Server 2008|2008]] up to [[Windows Server 2025|2025]]
*[[Windows 11|Microsoft Windows 11]]
*[[Windows 10|Microsoft Windows 10]]
*[[Windows 8.1|Microsoft Windows 8.1]]
*[[Windows 8|Microsoft Windows 8]]
*[[Windows 7|Microsoft Windows 7]]
*[[Windows Vista|Microsoft Windows Vista]]
===SAPI 4===
*[[Windows Server 2003|Microsoft Windows Server 2003]] and later
*[[Microsoft Windows XP]] and later
*[[Windows Me|Microsoft Windows Millennium Edition]]
*[[Windows 2000|Microsoft Windows 2000]]
*[[Windows 98|Microsoft Windows 98]]
*[[Windows NT 4.0|Microsoft Windows NT 4.0]]
*[[Windows 95|Microsoft Windows 95]]
==Major applications using SAPI==
<!-- Please only add MAJOR applications where speech input or output is a major feature. SDK's and simple TTS support do not qualify -->
<!-- When adding application, reference what features of SAPI are uses, for example, TTS or Speech Recognition -->
<!-- Consider splitting this list into TTS and speech rec. sections, and having dictation and command & control subsections -->
*[[Windows XP Tablet PC Edition|Microsoft Windows XP Tablet PC Edition]] includes SAPI 5.1 and speech recognition engines 6.1 for English, Japanese, and Chinese (simplified and traditional)
*[[Windows Speech Recognition]] in [[Windows Vista]] and later
*[[Microsoft Narrator]] in Windows 2000 and later Windows operating systems
*[[Microsoft Office XP]] and [[Office 2003]]
*[[Microsoft Excel]] 2002, Microsoft Excel 2003, and Microsoft Excel 2007 for speaking spreadsheet data
*[[Microsoft Voice Command]] for Windows Pocket PC and Windows Mobile
*[[Microsoft Plus#Microsoft Plus! for Windows XP|Microsoft Plus! Voice Command for Windows Media Player]]
*[[Adobe Reader]] uses voice output to read document content
*[[CoolSpeech]], a text-to-speech application that reads text aloud from a variety of sources
*[[Window-Eyes]] screen reader
*[[JAWS (screen reader)|JAWS]] screen reader
*[[NonVisual Desktop Access]] (NVDA), a free and open source screen reader
==See also==
* [[Comparison of speech synthesizers]]
* [[List of speech recognition software]]
* {{annotated link|SASDK}}
==References==
{{reflist}}
==External links==
*[
*[https://web.archive.org/web/20071016060248/http://www.microsoft.com/speech/
*[http://www.microsoft.com/downloads/details.aspx?FamilyID=5e86ec97-40a7-453f-b0ee-6583171b4530 Microsoft download site for Speech API Software Developers Kit version 5.1]
*[http://www.microsoft.com/msj/archive/s233.aspx Microsoft Systems Journal Whitepaper by Mike Rozak on the first version of SAPI]
*[http://blogs.msdn.com/speech Microsoft Speech Team blog]
{{Speech synthesis}}
[[Category: Microsoft application programming interfaces]]
[[Category: Speech processing software]]
[[Category: Voice technology]]
|