Streaming speech recognition
Streaming mode allows you to simultaneously send audio for recognition and get recognition results over the same connection. You can also get intermediate recognition results when the speaker has not yet finished the utterance. After a pause, SpeechKit returns the final results and starts recognizing the next utterance.
Voice assistants and smart speakers work using this recognition mode. When you activate the assistant, it starts transmitting speech to the server for recognition. The server processes the data and returns the intermediate and final recognition results of each utterance. The intermediate results are used to show the recognition progress. After the final results, the assistant performs an action, such as playing music or calling another person.
Warning
Streaming mode is designed for real-time audio recognition. To recognize a recorded audio file, use synchronous or asynchronous audio recognition.
Streaming recognition restrictions
SpeechKit streaming recognition has a number of restrictions that need to be taken into account when creating an application. For a full list of SpeechKit restrictions, see Quotas and limits in SpeechKit.
Streaming recognition | |
---|---|
Use cases | Telephone assistants and robots Virtual assistants |
Input data | Real-time voice |
How it works | Exchanging messages with the server over a single connection |
Supported APIs | gRPC v2 gRPC v3 |
Maximum duration of audio data | 5 minutes |
Maximum amount of transmitted data | 10 MB |
Number of recognition channels | 1 |
Using the service
To use the service, create an application that will send audio fragments and process responses with recognition results.
Client application interface code
SpeechKit has two streaming recognition API versions: API v3 and API v2. We recommend using the API v3 for new projects.
For the application to access the service, clone the Yandex Cloud API
Client application examples:
- Audio file streaming recognition using the API v3
- Microphone speech streaming recognition using the API v3
- Example use of streaming recognition with API v2
See also the gRPC documentation
Warning
When requesting the results of an operation, gRPC clients by default limit the maximum message size that they can accept as a response to no more than 4 MB. If a response with recognition results exceeds this amount, an error is returned.
To get the entire response, increase the maximum message size limit:
- For Go, use the MaxCallRecvMsgSize
function. - For C++, in the call
method, set themax_receive_message_size
value.
Authentication with the service
In each request, the application must pass an IAM token or API key for authentication in the service and the ID of the folder for which the account has the ai.speechkit-stt.user
role or higher. For more information about permissions, see Access management.
The most straightforward way to authenticate an application is to use a service account. When authenticating as a service account, do not indicate the folder ID in your requests: SpeechKit will use the same folder where the service account was created.
Learn more about authentication in SpeechKit.
Recognition request
To recognize speech, the application must first send a message with recognition settings:
- For the API v3, the RecognizeStreaming message with the
session_options
type. - For the API v2, the
StreamingRecognitionRequest
message with the RecognitionConfig type.
When the session is set up, the server will wait for messages with audio fragments (chunks). Send the RecognizeStreaming
message with the session_options type or the StreamingRecognitionRequest
message with the audio_content type in the API v2. Take the following recommendations into account when sending messages:
- Do not send audio fragments too often or infrequently. The time between messages to the service should be approximately the same as the duration of the audio fragments you send, but no more than 5 seconds. For example, send 400 ms of audio for recognition every 400 ms.
- Maximum duration of transmitted audio for the entire session: 5 minutes.
- Maximum size of transmitted audio data: 10 MB.
If messages aren't sent to the service within 5 seconds or the data duration or size limit is reached, the session is terminated. To continue speech recognition, reconnect and send a new message with the speech recognition settings.
SpeechKit returns intermediate speech recognition results before a message stream with audio fragments has finished.
Recognition result
In each recognition result message (StreamingResponse or StreamingRecognitionResponse), the SpeechKit server returns one or more speech fragments that it recognized during this period (chunks
). A list of recognized text alternatives is specified for each speech fragment (alternatives
).
The SpeechKit server returns recognition results together with their type:
-
partial
: For intermediate results -
final
: For final results -
final_refinement
: For normalized final resultsWith normalization enabled, you will get the
final
andfinal_refinement
results.
In the API v2, if recognition is not yet complete, the results will include the final
parameter set to False
.
Speech recognition completes and delivers final results upon EOU (End-of-Utterance). It is a marker of where an utterance ends. EOU occurs in the following cases:
-
The gRPC session has terminated.
-
Silence has been recognized in the last speech fragment. Use one of these two parameters to provide silence:
chunk
: Sound segment recognized as silence.silence_chunk
: Silence duration in milliseconds. This parameter allows you to reduce the audio packet size by excluding silence that does not require recognition.