Microsoft Speech API: Difference between revisions

Content deleted Content added
Dave w74 (talk | contribs)
Minor fix-ups
Dave w74 (talk | contribs)
Wrote the architecture section
Line 1:
: ''This is about the Speech API. For other meanings, see the disambiguation page [[SAPI]].''
 
The '''Speech Application Programming Interface''' or '''SAPI''' is an [[API]] developed by [[Microsoft]] to allow the use of [[Speech Recognition]] and [[Speech Synthesis]] within [[Microsoft Windows|Windows]] applications. To date a number of versions of the API have been released, which have shipped either as part of a Speech [[SDK]], or as part of the Windows [[OperatingSystemOperating System|OS]] itself. Applications that use SAPI include [[Microsoft Office]], [[Microsoft Agent]] and [[Microsoft Speech Server]].
 
In general all versions of the API have been designed such that a software developer can write an application to perform Speech Recognition and Synthesis by using a standard set of interfaces, accessible from a variety of programming languages. In addition, it is possible for a 3rd-party company to produce their own SRSpeech Recognition and TTS[[Speech Synthesis|Text-To-Speech]] engines or adapt existing engines to work with SAPI. AsIn principle, as long as these engines conform to the defined interfaces they can be used instead of the Microsoft-supplied engines.
 
''{Two families of APIs; free of charge}''
 
==Basic Architecture==
 
''Need to explain the distinction between the pure 'API' and the larger SDK. These are the main components of the SDK:''
Broadly the Speech API can be viewed as an interface or piece of middleware which sits between ''applications'' and speech ''engines'' {recognition and synthesis}. In SAPI versions 1 to 4, applications directly communicated with engines. The API was an abstract ''interface definition'' which applications and engines conformed to. In general there was no runtime code between an application and an engine.
*API definition
 
*Runtime
In SAPI 5 however, applications and engines do not directly communicate with each other. Instead each talk to a runtime component {''sapi.dll''}. There's an API implemented by this component which applications use, and another set of interfaces for engines {this is sometimes referred to informally as the Device Driver Interface (''DDI''), although no part of the recognition engines are actual device drivers}.
*Control Panel
 
*TTS engine
Typically in SAPI 5 applications issue calls through the API {for example to load a recognition grammar; start recognition; or provide text to be synthesized}. The sapi.dll runtime component interprets these commands and processes them, where necessary calling on the engine through the engine interfaces {for example, the loading of a grammar from a file is done in the runtime, but then the grammar data is passed to the recognition engine to actually use in recognition}. The recognition and synthesis engines also generate events while processing {for example, to indicate an utterance has been recognized or to indicate word boundaries in the synthesized speech}. These pass in the reverse direction, from the engines, through the runtime dll, and on to an ''event sink'' in the application.
*SR engine
 
*Docs and samples
 
In additional to the actual API definition and runtime dll, other components are shipped with all versions of SAPI to make a complete Speech [[Software Development Kit]]. The following components are among those included in most versions of the Speech SDK:
*''API definition files'' - in [[MIDL]] and as C or C++ header files.
*''Runtime components'' - e.g. sapi.dll.
*''Control Panel applet'' - to select and configure default speech recognizer and synthesizer.
*''Text-To-Speech engines'' in multiple languages.
*''Speech Recognition engines'' in multiple languages.
*''Redistributable components'' to allow developers to packaging the engines and runtime with their application to produce a single installable package.
*''Sample application code''.
*''Sample engines'' - implementations of the necessary engine interfaces but with no true speech procesing which could be used as a sample for those porting an engine to SAPI.
*''Documentation''.