Streaming speech recognition

Written by

Improved by

Updated at April 30, 2025

Streaming recognition restrictions
Using the service
Use cases

Streaming mode allows you to simultaneously send audio for recognition and get recognition results over the same connection. You can also get intermediate recognition results when the speaker has not yet finished the utterance. After a pause, SpeechKit returns the final results and starts recognizing the next utterance.

Voice assistants and smart speakers work using this recognition mode. When you activate the assistant, it starts transmitting speech to the server for recognition. The server processes the data and returns the intermediate and final recognition results of each utterance. The intermediate results are used to show the recognition progress. After the final results, the assistant performs an action, such as playing music or calling another person.

Warning

Streaming mode is designed for real-time audio recognition. To recognize a recorded audio file, use synchronous or asynchronous audio recognition.

Streaming recognition restrictions

SpeechKit streaming recognition has a number of restrictions that need to be taken into account when creating an application. For a full list of SpeechKit restrictions, see Quotas and limits in SpeechKit.

	Streaming recognition
Use cases	Phone assistants and robots Virtual assistants
Input data	Real-time voice
How it works	Exchanging messages with the server over a single connection
Supported APIs	gRPC v2 gRPC v3
Maximum duration of audio data	5 minutes
Maximum amount of transmitted data	10 MB
Number of recognition channels	1

Using the service

To use the service, create an application that will send audio fragments and process responses with recognition results.

Client application interface code

SpeechKit has two streaming recognition API versions: API v3 and API v2. We recommend using the API v3 for new projects.

For the application to access the service, clone the Yandex Cloud API repository and generate the client interface code for the used programming language from the API v2 or API v3 specification file.

Client application examples:

See also the gRPC documentation for detailed instructions on how to generate interfaces and implement client apps in various programming languages.

Warning

When requesting the results of an operation, gRPC clients by default limit the maximum message size that they can accept as a response to no more than 4 MB. If a response with recognition results exceeds this amount, an error is returned.

To get the entire response, increase the maximum message size limit:

For Go, use the MaxCallRecvMsgSize function.
For C++, in the call method, set the max_receive_message_size value.

Authentication with the service

In each request, the application must pass an IAM token or API key for authentication in the service and the ID of the folder for which the account has the ai.speechkit-stt.user role or higher. For more information about permissions, see Access management.

The most straightforward way to authenticate an application is to use a service account. When authenticating as a service account, do not indicate the folder ID in your requests: SpeechKit will use the same folder where the service account was created.

Learn more about authentication in SpeechKit.

Recognition request

For speech recognition, the application must first send a recognition settings message:

For the API v3, the RecognizeStreaming message of the session_options type.
For the API v2, the StreamingRecognitionRequest message of the RecognitionConfig type.

When the session is set up, the server will wait for messages with audio fragments (chunks). Send the RecognizeStreaming message of the session_options type or the StreamingRecognitionRequest message of the audio_content type to the API v2. Take the following recommendations into account when sending messages:

Do not send audio fragments too often or infrequently. The time between messages to the service should be approximately the same as the duration of the audio fragments you send, but no more than 5 seconds. For example, send 400 ms of audio for recognition every 400 ms.
Maximum duration of transmitted audio for the entire session: 5 minutes.
Maximum size of transmitted audio data: 10 MB.

If messages aren't sent to the service within 5 seconds or the data duration or size limit is reached, the session is terminated. To continue speech recognition, reconnect and send a new message with the speech recognition settings.

SpeechKit returns intermediate speech recognition results before a message stream with audio fragments has finished.

Recognition result

In each recognition result message (StreamingResponse or StreamingRecognitionResponse), the SpeechKit server returns one or more speech chunks it was able to recognize during this period (chunks). A list of recognized text variants is specified for each chunk (alternatives).

The SpeechKit server returns recognition results stating their type:

partial: For intermediate results.
final: For final results.
final_refinement: For normalized final results.

With normalization enabled, you will get the final and final_refinement results.

In the API v2, if recognition is not yet complete, the results will contain the final parameter set to False.

Speech recognition completes and delivers final results upon EOU (End-of-Utterance). It is a marker of where an utterance ends. EOU occurs in the following cases:

The gRPC session is terminated.
Silence has been recognized in the last speech fragment. Silence can be represented by one of these two parameters:
- chunk: Sound recognized as silence.
- silence_chunk: Silence duration in milliseconds. This parameter allows you to reduce the audio packet size by excluding silence that does not require recognition.

Streaming speech recognition

Streaming recognition restrictionsStreaming recognition restrictions

Using the serviceUsing the service

Client application interface codeClient application interface code

Authentication with the serviceAuthentication with the service

Recognition requestRecognition request

Recognition resultRecognition result

Use casesUse cases

See alsoSee also

Was the article helpful?