Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • About the technology
    • Supported languages
    • Streaming recognition
    • Recognition result normalization
    • Analyzing recognition results
    • Speaker labeling
    • Extending a speech recognition model
    • Uploading fine-tuning data for a speech recognition model
    • Detecting the end of utterance
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy

In this article:

  • Streaming recognition restrictions
  • Using the service
  • Client application interface code
  • Authentication with the service
  • Recognition request
  • Recognition result
  • Use cases
  1. Speech recognition
  2. Streaming recognition

Streaming speech recognition

Written by
Yandex Cloud
Improved by
amatol
Updated at April 30, 2025
  • Streaming recognition restrictions
  • Using the service
    • Client application interface code
    • Authentication with the service
    • Recognition request
    • Recognition result
  • Use cases

Streaming mode allows you to simultaneously send audio for recognition and get recognition results over the same connection. You can also get intermediate recognition results when the speaker has not yet finished the utterance. After a pause, SpeechKit returns the final results and starts recognizing the next utterance.

Voice assistants and smart speakers work using this recognition mode. When you activate the assistant, it starts transmitting speech to the server for recognition. The server processes the data and returns the intermediate and final recognition results of each utterance. The intermediate results are used to show the recognition progress. After the final results, the assistant performs an action, such as playing music or calling another person.

Warning

Streaming mode is designed for real-time audio recognition. To recognize a recorded audio file, use synchronous or asynchronous audio recognition.

Streaming recognition restrictions

SpeechKit streaming recognition has a number of restrictions that need to be taken into account when creating an application. For a full list of SpeechKit restrictions, see Quotas and limits in SpeechKit.

Streaming recognition
Use cases Phone assistants and robots
Virtual assistants
Input data Real-time voice
How it works Exchanging messages with the server over a single connection
Supported APIs gRPC v2
gRPC v3
Maximum duration of audio data 5 minutes
Maximum amount of transmitted data 10 MB
Number of recognition channels 1

Using the service

To use the service, create an application that will send audio fragments and process responses with recognition results.

Client application interface code

SpeechKit has two streaming recognition API versions: API v3 and API v2. We recommend using the API v3 for new projects.

For the application to access the service, clone the Yandex Cloud API repository and generate the client interface code for the used programming language from the API v2 or API v3 specification file.

Client application examples:

  • Audio file streaming recognition using the API v3.
  • Microphone speech streaming recognition using the API v3.
  • Example use of streaming recognition with API v2.

See also the gRPC documentation for detailed instructions on how to generate interfaces and implement client apps in various programming languages.

Warning

When requesting the results of an operation, gRPC clients by default limit the maximum message size that they can accept as a response to no more than 4 MB. If a response with recognition results exceeds this amount, an error is returned.

To get the entire response, increase the maximum message size limit:

  • For Go, use the MaxCallRecvMsgSize function.
  • For C++, in the call method, set the max_receive_message_size value.

Authentication with the service

In each request, the application must pass an IAM token or API key for authentication in the service and the ID of the folder for which the account has the ai.speechkit-stt.user role or higher. For more information about permissions, see Access management.

The most straightforward way to authenticate an application is to use a service account. When authenticating as a service account, do not indicate the folder ID in your requests: SpeechKit will use the same folder where the service account was created.

Learn more about authentication in SpeechKit.

Recognition request

For speech recognition, the application must first send a recognition settings message:

  • For the API v3, the RecognizeStreaming message of the session_options type.
  • For the API v2, the StreamingRecognitionRequest message of the RecognitionConfig type.

When the session is set up, the server will wait for messages with audio fragments (chunks). Send the RecognizeStreaming message of the session_options type or the StreamingRecognitionRequest message of the audio_content type to the API v2. Take the following recommendations into account when sending messages:

  • Do not send audio fragments too often or infrequently. The time between messages to the service should be approximately the same as the duration of the audio fragments you send, but no more than 5 seconds. For example, send 400 ms of audio for recognition every 400 ms.
  • Maximum duration of transmitted audio for the entire session: 5 minutes.
  • Maximum size of transmitted audio data: 10 MB.

If messages aren't sent to the service within 5 seconds or the data duration or size limit is reached, the session is terminated. To continue speech recognition, reconnect and send a new message with the speech recognition settings.

SpeechKit returns intermediate speech recognition results before a message stream with audio fragments has finished.

Recognition result

In each recognition result message (StreamingResponse or StreamingRecognitionResponse), the SpeechKit server returns one or more speech chunks it was able to recognize during this period (chunks). A list of recognized text variants is specified for each chunk (alternatives).

The SpeechKit server returns recognition results stating their type:

  • partial: For intermediate results.

  • final: For final results.

  • final_refinement: For normalized final results.

    With normalization enabled, you will get the final and final_refinement results.

In the API v2, if recognition is not yet complete, the results will contain the final parameter set to False.

Speech recognition completes and delivers final results upon EOU (End-of-Utterance). It is a marker of where an utterance ends. EOU occurs in the following cases:

  • The gRPC session is terminated.

  • Silence has been recognized in the last speech fragment. Silence can be represented by one of these two parameters:

    • chunk: Sound recognized as silence.
    • silence_chunk: Silence duration in milliseconds. This parameter allows you to reduce the audio packet size by excluding silence that does not require recognition.

Use cases

  • Audio file streaming recognition using the API v3
  • Microphone speech streaming recognition using the API v3
  • Streaming speech recognition with auto language detection in the API v3
  • Example use of streaming recognition with API v2

See also

  • Supported audio formats
  • Supported languages and recognition models
  • Authentication with the SpeechKit API
  • API v2 for streaming recognition
  • API v3 reference

Was the article helpful?

Previous
Supported languages
Next
Synchronous recognition
© 2025 Direct Cursus Technology L.L.C.