Microsoft Speech API

This is an old revision of this page, as edited by 59.93.102.107 (talk) at 12:54, 4 October 2006. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

hello hello of the Microsoft Speech API. SAPI versions 1 through 4 are all similar to each other, with extra features in each newer version. SAPI 5 however was a completely new interface, released in 2000. Since then several sub-versions of this API have been released.


Basic architecture

Broadly the Speech API can be viewed as

SAPI 2

SAPI 2.0 was released in 1996.

SAPI 3

SAPI 3.0 added limited support for dictation speech recognition (discrete speech, not continuous), and additional sample apps and audio sources.

SAPI 4

SAPI 4.0 was released in 1998. This version of SAPI included both the core COM API; together with C++ wrapper classes to make programming from C++ easier; and ActiveX controls to allow drag-and-drop Visual Basic development. This was shipped as part of an SDK that included recognition and synthesis engines. It also shipped {with synthesis engines only} in Windows 2000.

The main components of the SAPI 4 API {which were all available in C++, COM, and ActiveX flavors} were:

  • Voice Command - high-level objects for command & control speech recognition
  • Voice Dictation - high-level objects for continuous dictation speech recognition
  • Voice Talk - high-level objects for speech synthesis
  • Voice Telephony - objects for writing telephone speech applications
  • Direct Speech Recognition - objects for direct control of recognition engine
  • Direct Text To Speech - objects for direct control of synthesis engine
  • Audio objects - for reading to and from an audio device or file


SAPI 5 API family

The Speech SDK version 5.0, incorporating the SAPI 5.0 runtime was released in 2000. This was a complete redesign from previous versions and neither engines nor applications which used older versions of SAPI could use the new version without considerable modification.

The design of the new API included the concept of strictly separating the application and engine so all calls were routed though the runtime sapi.dll. This change was intended to make the API more 'engine-independent', preventing applications from inadvertently depending on features of a specific engine. In addition this change was aimed at making it much easier to incorporate speech technology into an application by moving some management and initialization code into the runtime.

The new API was initially a pure COM API and could be used easily only from C/C++. Support for VB and scripting languages were added later. Operating systems from Windows 98 and NT 4.0 upwards were supported.

Major features of the API include:

  • Shared Recognizer. For desktop speech recognition applications, a recognizer object can be used that runs in a separate process (sapisvr.exe). All applications using the shared recognizer communicate with this single instance. This allows sharing of resources, removes contention for the microphone and allows for a global UI for control of all speech apps.
  • In-proc recognizer. For apps that require explicit control of the recognition process the in-proc recognizer object can be used instead of the shared one.
  • Grammar objects. Speech grammars are used to specify the words that the recognizer is listening for. SAPI 5 defines an XML markup for specifying a grammar, as well as mechanisms to create them dynamically in code. Methods also exist for instructing the recognizer to load a built-in dictation language model.
  • Voice object. This performs speech synthesis, producing an audio stream from text. A markup language {similar to XML, but not strictly XML} can be used for controlling the synthesis process.
  • Audio interfaces. The runtime includes objects for performing speech input from the microphone or speech output to speakers {or any sound device}; as well as to and from wave files. It is also possible to write a custom audio object to stream audio to or from a non-standard ___location.
  • User lexicon object. This allows custom words and pronunciations to be added by a user or application. These are added to the recognition or synthesis engine's built-in lexicons.
  • Object tokens. This is a concept allowing recognition and TTS engines, audio objects, lexicons and other categories of object to be registered, enumerated and instantiated in a common way.


SAPI 5.0

This version shipped in late 2000 as part of the Speech SDK version 5.0, together with version 5.0 recognition and synthesis engines. The recognition engines supported continuous dictation and command & control and were released in U.S. English, Japanese and Simplified chinese versions. In the U.S. English system, special acoustic models were available for children's speech and telephony speech. The synthesis engine was available in English and Chinese. This version of the API and recognition engines also shipped in Microsoft Office XP in 2001.

SAPI 5.1

This version shipped in late 2001 as part of the Speech SDK version 5.1. Automation-compliant interfaces were added to the API to allow use from Visual Basic, scripting languages such as JScript, and managed code. This version of the API and TTS engines were shipped in [