FAQ about SpeechKit

Written by

Updated at June 20, 2025

General questions about
Speech recognition (STT)
Speech synthesis TTS

General questions about

Can I get logs of my operations in Yandex Cloud?

Yes, you can request information about operations with your resources from Yandex Cloud logs. For more information, see Data requests.

Where can I view API request statistics?

You can view the statistics in the management console under SpeechKit → Monitoring.

As you use the service, your balance automatically decreases. Learn more about paying for resources in Yandex Cloud.

How do I increase quotas? / What should I do if the "429 Too Many Requests" error occurs?

The 429 Too Many Requests error indicates that the quotas set for your folder have been exceeded. The default quota values are specified in this table.

To increase the quotas, contact support. In your request, specify the folder ID and the required quotas in "Name — Value" format.

What is the purpose of the folder ID (folderId)?

SpeechKit uses folderId for authorization (access permission verification) and payments for resources.

When sending a request under a service account, you do not need to specify folderId: by default, the account's home folder ID is used. If you specify a different folder, the service will return an error.

If logged in under a user account, you must specify folderId.

Depending on the API used, you should include the folder ID in the request header or body. For more information about authentication in SpeechKit, see Authentication with the SpeechKit API.

How do I generate an IAM token?

Among other methods, you can generate an IAM token using the Yandex Cloud command line interface. You can find all the possible ways of obtaining an IAM token for your Yandex account, federated account, and service account in the Yandex Identity and Access Management documentation.

Even though IAM tokens are only valid for 12 hours, we still recommend them as the most secure authentication method for SpeechKit.

What should I do if I get the "401 Unauthorized" error?

If the SpeechKit server returns the 401 Unauthorized error in response to a request, check if your account has the required roles. In addition, check your authorization method and the key or token you are using for authorization: you may have obtained it for a different account than the one you are using for your request.

If using a service account, do not specify folder ID in your requests: SpeechKit uses the service account's home folder.

Under what conditions can I use SpeechKit performance results?

We do not impose any restrictions on the use of SpeechKit performance results. You can use them at your discretion under the Russian law.
Feel free to use the demo stand. Please note that certain restrictions apply.
You can work with SpeechKit via the API or Playground web interface.

For terms of use, see the Yandex SpeechKit Terms of Use.

Speech recognition (STT)

Incorrect stress and pronunciation

Create a request and attach examples so that developers can make adjustments to the next releases of the speech synthesis model.

Poor speech recognition quality at 8kHz

If the issue is systematic (tens of percent of the total number of speech recognition requests), submit a request and attach examples for analysis. The more examples you send, the more likely the developers will discover the bug.

Feedback form on speech recognition quality

If you have any issues, please contact support and provide files and description.

Two channels were recognized as one / How to recognize each channel separately

You can recognize multi-channel audio files only using asynchronous recognition.

Check the format of your recording:

For LPCM, the config.specification.audioChannelCount parameter should equal 2.
Do not specify this parameter for MP3 and OggOpus, since the number of channels is already stated in the file. The file will be automatically split into the appropriate number of recordings.

The recognized text in the response is separated by the channelTag parameter.

Is it possible to recognize two or more voices separated by speaker?

You can recognize multi-channel audio files only using asynchronous recognition.

During speech recognition, text is not split by voice, but you can place the voices in different channels and separate the recognized text in the response with the channelTag parameter.

The number of channels can be specified in a request using the config.specification.audioChannelCount parameter.

Incomplete audio recognition

If recognizing streaming audio, try using different API versions: API v2 or API v3.

To recognize an audio file, try different models.

The file doesn't exceed the limit, but an error occurs during recognition

If the file is multi-channel, take into account the total recording time of all channels. For the full list of limitations, see Quotas and limits in SpeechKit.

Internal Server Error

Make sure the format you specified in the request matches the actual file format. If the error persists, send us examples of your audio files that cannot be recognized.

When is a response sent during recognition?

Under synchronous and asynchronous recognition, a response is sent once: after processing the request.

In streaming recognition mode, you can configure the server behavior. By default, the server returns a response only after the received utterance is fully recognized. You can use the partialResults parameter to set up recognition so that the server also returns intermediate recognition results.

Intermediate results allow you to quickly respond to the recognized speech without waiting for the end of the utterance.

Where can I find an example of audio file recognition?

For SpeechKit usage examples, see Tutorials. To recognize pre-recorded audio files, use asynchronous recognition.

Where can I find an example of microphone speech recognition?

Example of streaming recognition of microphone-recorded speech.

Can I use POST for streaming recognition?

Streaming recognition uses gRPC and is not supported by the REST API, so you cannot use the POST method.

A streaming recognition session is broken/terminated

When using the API v2 for streaming recognition, the service awaits audio data. If it does not receive any data within 5 seconds, the session terminates. You cannot change this parameter in the API v2.

Streaming recognition runs in real time. You can send "silence" for recognition so that the service does not terminate the connection.

We recommend using the API v3 for streaming recognition. The API v3 features a special message type for sending "silence", so you will not have to simulate it yourself in your audio recording.

How does the service figure out the end of an utterance and the duration of a recognition session?

The end of an utterance is determined automatically by the "silence" after the utterance. For more information about end-of-utterance detection, see Detecting the end of utterance.

The maximum session duration for streaming recognition is 5 minutes.

What should I do if SpeechKit does not listen to a conversation to the end or, conversely, it takes too long to wait until it ends?

Interruptions or delays during streaming recognition may occur due to detecting the end of utterance (EOU). For EOU setup recommendations, see Detecting the end of utterance.

Error: OutOfRange desc = Exceeded maximum allowed stream duration

This error means that the maximum allowed duration of a recognition session has been exceeded. In this case, you need to reopen the session.

For streaming recognition, the maximum session duration is 5 minutes. This is a technical limitation due to the Yandex Cloud architecture and it cannot be changed.

What goes into the usage cost?

For examples of calculating the usage cost, pricing rules, and effective prices, see SpeechKit pricing policy.

Speech synthesis TTS

How can I voice long texts?

To voice a large text, break it into parts in any way convenient for you. The maximum size of speech synthesis requests is limited to 5,000 characters.

How do I configure stress and pronunciation?

To adjust the pronunciation of individual words and the text in general, use the SSML or TTS markup.

How do I add a pause in text?

To add a pause to your text, use TTS markup. Specify the pause duration in milliseconds in parentheses. A pause will appear where you place the tag. Here is an example: Start sil<[3000]> continue in 3 seconds. The maximum value is 7,000.

Note that the pause is indicated as a recommendation only. SpeechKit is designed to generate natural speech.

A cURL request does not work in Windows PowerShell

In the Windows PowerShell terminal, the curl command is an alias for the Invoke-WebRequest system call.

The Yandex Cloud documentation provides examples of API calls using the Bash shell syntax. You can run them as is in the Linux console, macOS terminal, or WSL in Windows 10 or higher. To run the examples in Windows PowerShell, you will have to modify them yourself. For more on command equivalents between Bash and PowerShell, as well as tips, see Working with the Yandex Cloud CLI and API in Microsoft Windows.

What goes into the cost of synthesis?

For examples of calculating the usage cost, pricing rules, and effective prices, see SpeechKit pricing policy.

FAQ about SpeechKit

General questions aboutGeneral questions about

Can I get logs of my operations in Yandex Cloud?Can I get logs of my operations in Yandex Cloud?

Where can I view API request statistics?Where can I view API request statistics?

How do I increase quotas? / What should I do if the "429 Too Many Requests" error occurs?How do I increase quotas? / What should I do if the "429 Too Many Requests" error occurs?

What is the purpose of the folder ID (folderId)?What is the purpose of the folder ID (folderId)?

How do I generate an IAM token?How do I generate an IAM token?

What should I do if I get the "401 Unauthorized" error?What should I do if I get the "401 Unauthorized" error?

Under what conditions can I use SpeechKit performance results?Under what conditions can I use SpeechKit performance results?

Speech recognition (STT)Speech recognition (STT)

Incorrect stress and pronunciationIncorrect stress and pronunciation

Poor speech recognition quality at 8kHzPoor speech recognition quality at 8kHz

Feedback form on speech recognition qualityFeedback form on speech recognition quality

Two channels were recognized as one / How to recognize each channel separatelyTwo channels were recognized as one / How to recognize each channel separately

Is it possible to recognize two or more voices separated by speaker?Is it possible to recognize two or more voices separated by speaker?

Incomplete audio recognitionIncomplete audio recognition

The file doesn't exceed the limit, but an error occurs during recognitionThe file doesn't exceed the limit, but an error occurs during recognition

Internal Server ErrorInternal Server Error

When is a response sent during recognition?When is a response sent during recognition?

Where can I find an example of audio file recognition?Where can I find an example of audio file recognition?

Where can I find an example of microphone speech recognition?Where can I find an example of microphone speech recognition?

Can I use POST for streaming recognition?Can I use POST for streaming recognition?

A streaming recognition session is broken/terminatedA streaming recognition session is broken/terminated

How does the service figure out the end of an utterance and the duration of a recognition session?How does the service figure out the end of an utterance and the duration of a recognition session?

What should I do if SpeechKit does not listen to a conversation to the end or, conversely, it takes too long to wait until it ends?What should I do if SpeechKit does not listen to a conversation to the end or, conversely, it takes too long to wait until it ends?

Error: OutOfRange desc = Exceeded maximum allowed stream durationError: OutOfRange desc = Exceeded maximum allowed stream duration

What goes into the usage cost?What goes into the usage cost?

Speech synthesis TTSSpeech synthesis TTS

How can I voice long texts?How can I voice long texts?

How do I configure stress and pronunciation?How do I configure stress and pronunciation?

How do I add a pause in text?How do I add a pause in text?

A cURL request does not work in Windows PowerShellA cURL request does not work in Windows PowerShell

What goes into the cost of synthesis?What goes into the cost of synthesis?

Was the article helpful?