Extending a speech recognition model

Written by

Updated at June 20, 2025

Auto-tuning based on logged data
Using audio to improve quality
Fine-tuning

SpeechKit provides multiple ways to improve speech recognition:

Auto-tuning based on logged data

By default, SpeechKit does not save data provided by users. However, the most effective way to improve a speech recognition model is to train it on real user data.

To improve the quality of speech recognition, use model auto-tuning. It will enable you to save data transmitted in requests and use it for further training. Auto-tuning improves recognition quality while the model is running without any further data collecting actions on your part.

Auto-tuning is a good option under these conditions:

The current script used for work tasks fails to recognize some of the vocabulary.
The vocabulary for auto-tuning should be easy to perceive by ear and to transcribe. For example, drug names are not suitable as these terms come from a narrow domain. To recognize domain-specific terms, use model tuning.

To send data for auto-tuning, provide the x-data-logging-enabled: true header in your API requests. For an example with logging enabled, see Request headers for troubleshooting in Yandex SpeechKit. Then contact support to request model fine-tuning on the provided data.

Auto-tuning tips:

It will take a minimum of 10 hours of audio in Russian to achieve a difference in the quality of recognition. Recognition models for other languages may require more data. The recommended data volume is 50 hours or more.
Recognition model training takes about three months for Russian. During this time, the team will check and validate the data, add it to the training dataset, and train the model. For other languages, contact your account manager.

Using audio to improve quality

You can improve speech recognition by submitting an audio file to the SpeechKit team. This method is similar to auto-tuning but uses an audio prepared by you instead of data provided in API requests. Submit it to the support team as a ZIP archive. You can also attach transcripts of the messages, but this is optional.

For the recommended data size, refer to the auto-tuning restrictions.

Fine-tuning

The basic speech recognition model is designed to work with everyday language, but it may not be sufficient to recognize specific vocabulary. By tuning, you can train the model to recognize domain-specific terms from different fields:

Medicine: Diagnoses, biological terms, drug names.
Business: Company names.
Trade: Product ranges (jewelry, electronics, and so on).
Finance: Banking terms and names of banking products.

For tuning you will need a list of terms (words or phrases) and at least three free-form text examples for each term.

Tuning is available for the Russian language only.

Tuning takes about two months from when you submit the data archive to the support team.

Extending a speech recognition model

Auto-tuning based on logged dataAuto-tuning based on logged data

Using audio to improve qualityUsing audio to improve quality

Fine-tuningFine-tuning

Was the article helpful?

Auto-tuning based on logged data

Using audio to improve quality

Fine-tuning