Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • About the technology
    • Supported languages
    • Streaming recognition
    • Recognition result normalization
    • Analyzing recognition results
    • Speaker labeling
    • Extending a speech recognition model
    • Uploading fine-tuning data for a speech recognition model
    • Detecting the end of utterance
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy

In this article:

  • Auto-tuning based on logged data
  • Using audio to improve quality
  • Fine-tuning
  • Text template
  • Glossary
  • Restrictions and requirements for text templates and glossaries
  1. Speech recognition
  2. Extending a speech recognition model

Extending a speech recognition model

Written by
Yandex Cloud
Updated at April 24, 2025
  • Auto-tuning based on logged data
  • Using audio to improve quality
  • Fine-tuning
    • Text template
    • Glossary
    • Restrictions and requirements for text templates and glossaries

SpeechKit provides multiple ways to improve speech recognition:

  • Auto-tuning
  • Model tuning
  • Using audio to improve quality

Auto-tuning based on logged data

By default, SpeechKit does not save data provided by users. However, the most effective way to improve a speech recognition model is to train it on real user data.

To improve the quality of speech recognition, use model auto-tuning. It will enable you to save data transmitted in requests and use it for further training. Auto-tuning improves recognition quality while the model is running without any further data collecting actions on your part.

Auto-tuning is a good option under these conditions:

  • The current script used for work tasks fails to recognize some of the vocabulary.
  • The vocabulary for auto-tuning should be easy to perceive by ear and to transcribe. For example, drug names are not suitable as these terms come from a narrow domain. To recognize domain-specific terms, use model tuning.

To send data for auto-tuning, provide the x-data-logging-enabled: true header in your API requests. For an example with logging enabled, see Request headers for troubleshooting in Yandex SpeechKit. Then contact support to request model fine-tuning on the provided data.

Auto-tuning tips:

  • It will take a minimum of 10 hours of audio in Russian to achieve a difference in the quality of recognition. Recognition models for other languages may require more data. The recommended data volume is 50 hours or more.
  • Recognition model training takes about three months for Russian. During this time, the team will check and validate the data, add it to the training dataset, and train the model. For other languages, contact your account manager.

Using audio to improve quality

You can improve speech recognition by submitting an audio file to the SpeechKit team. This method is similar to auto-tuning but uses an audio prepared by you instead of data provided in API requests. Submit it to the support team as a ZIP archive. You can also attach transcripts of the messages, but this is optional.

For the recommended data size, refer to the auto-tuning restrictions.

Fine-tuning

The basic speech recognition model is designed to work with everyday language, but it may not be sufficient to recognize specific vocabulary. By tuning, you can train the model to recognize domain-specific terms from different fields:

  • Medicine: Diagnoses, biological terms, drug names.
  • Business: Company names.
  • Trade: Product ranges (jewelry, electronics, and so on).
  • Finance: Banking terms and names of banking products.

Model tuning relies on text templates and glossaries.

Text template

A text template contains similar phrases with variables the model uses to generate utterances. The variables are replaced with glossary terms.

To improve overall recognition quality, you can use templates together with auto-tuning. They work even for domain-specific vocabulary, such as drug names.

Glossary

A glossary is a full list of terms used in place of variables in templates. It contains words found in test audio files, as well as other vocabulary.

A glossary must be provided as a separate file. One such file is created for each template variable. The variable serves as the name for the relevant glossary file. Each term occupies a separate line in the file.

For example, the first-name.tsv, middle-name.tsv, and last-name.tsv glossary files used for tuning a call center model may contain the first, middle, and last names of customers.

first-name.tsv middle-name.tsv last-name.tsv
John
Tom
Peter
...
Wendell
Sean
Larry
...
Thompson
Carter
Smith
...

If the pattern phrases assume that the glossary terms may have possessive case forms, you need to create a separate glossary file for each form. For example, files with names in the possessive case will contain the following entries:

first-name-ablative.tsv middle-name-ablative.tsv last-name-ablative.tsv
John
Tom
Peter
...
Wendell
Sean
Larry
...
Thompson's
Carter's
Smith's
...

Then, the templates.tsv file may contain entries in the following format:

Hello, are you {first-name=first-names.tsv} {middle-name=middle-names.tsv} {last-name=last-names.tsv}?
Hello, can I talk to {first-name=first-names-ablative.tsv} {middle-name=middle-names-ablative.tsv}?

Restrictions and requirements for text templates and glossaries

Glossaries and text templates must be provided in normalized TSV format:

  • Numerals: Written as words.
  • Latin words and characters: Transcribed.
  • Abbreviations: Spelled out.
  • Foreign language acronyms: Spelled out or transcribed.
  • Russian acronyms: Left unchanged.

: We're giving away, i.e. for free, 2 kilos of potatoes, a DNA helix model, and some ABC magazine from 2020.

: We are giving away, that is for free, two kilos of potatoes, a dee-en-ay helix model, and some ay-bee-cee magazines from twenty-twenty.

Requirements for text templates and glossaries:

  • The length of a template including variables must not exceed 200 characters.
  • Only Russian is supported.
  • Tuning works better if there are at least 100 phrases for each glossary term.

Tuning takes about two months from when you submit the data archive to the support team.

Was the article helpful?

Previous
Speaker labeling
Next
Uploading fine-tuning data for a speech recognition model
© 2025 Direct Cursus Technology L.L.C.