Extending a speech recognition model
SpeechKit provides multiple ways to improve speech recognition:
Auto-tuning
By default, SpeechKit does not save data provided by users. However, the most effective way to improve a speech recognition model is to train it on real user data.
To improve the quality of speech recognition, use model auto-tuning. It will enable you to save data transmitted in requests and use it for further training. Auto-tuning improves recognition quality while the model is running without any further actions on your part.
Auto-tuning is a good option under these conditions:
- The current script used for work tasks fails to recognize some of the vocabulary.
- The vocabulary for auto-tuning should be easy to perceive by ear and to transcribe. For example, drug names are not suitable as these terms come from a narrow domain. To recognize domain-specific terms, use model tuning.
To send data for auto-tuning, enable logging. To do this, provide the x-data-logging-enabled: true
header in your API requests. For an example with logging enabled, see Troubleshooting in Yandex SpeechKit.
Restrictions and requirements:
- A minimum of 10 hours of audio is required. Recommended number of hours: 50 or more.
- Training the recognition model takes about three months for Russian. For other languages, contact your account manager.
Fine-tuning
The basic speech recognition model is designed to work with everyday language, but it may not be sufficient to recognize specific vocabulary. By tuning, you can train the model to recognize domain-specific terms from different fields:
- Medicine: Diagnoses, biological terms, drug names.
- Business: Company names.
- Trade: Product ranges (jewelry, electronics, and so on).
- Finance: Banking terms and names of banking products.
Model tuning relies on text templates and glossaries.
Text template
A text template contains similar phrases with variables the model uses to generate utterances. The variables are replaced with glossary terms.
To improve overall recognition quality, you can use templates together with auto-tuning. They work even for domain-specific vocabulary, such as drug names.
Glossary
A glossary is a full list of terms used in place of variables in templates. It contains words found in test audio files, as well as other vocabulary.
A glossary must be provided as a separate file. One such file is created for each template variable. The variable serves as the name for the relevant glossary file. Each term occupies a separate line in the file.
For example, the first-name.tsv
, middle-name.tsv
, and last-name.tsv
glossary files used for tuning a call center model may contain the first, middle, and last names of customers.
first-name.tsv | middle-name.tsv | last-name.tsv |
---|---|---|
John Tom Peter ... |
Wendell Sean Larry ... |
Thompson Carter Smith ... |
If the pattern phrases assume that the glossary terms may have possessive case forms, you need to create a separate glossary file for each form. For example, files with names in the possessive case will contain the following entries:
first-name-ablative.tsv | middle-name-ablative.tsv | last-name-ablative.tsv |
---|---|---|
John Tom Peter ... |
Wendell Sean Larry ... |
Thompson's Carter's Smith's ... |
Then, the templates.tsv
file may contain entries in the following format:
Hello, are you {first-name=first-names.tsv} {middle-name=middle-names.tsv} {last-name=last-names.tsv}?
Hello, can I talk to {first-name=first-names-ablative.tsv} {middle-name=middle-names-ablative.tsv}?
Restrictions and requirements for text templates and glossaries
Glossaries and text templates must be provided in normalized TSV
- Numerals: Written as words.
- Latin words and characters: Transcribed.
- Abbreviations: Spelled out.
- Foreign language acronyms: Spelled out or transcribed.
- Russian acronyms: Left unchanged.
: We're giving away, i.e. for free, 2 kilos of potatoes, a DNA helix model, and some ABC magazine from 2020.
: We are giving away, that is for free, two kilos of potatoes, a dee-en-ay helix model, and some ay-bee-cee magazines from twenty-twenty.
Requirements for text templates and glossaries:
- The length of a template including variables must not exceed 200 characters.
- Only Russian is supported.
- Tuning works better if there are at least 100 phrases for each glossary term.
Tuning takes about two months from when you submit the data archive to the support team.
Using audio to improve quality
You can improve speech recognition by submitting an audio file to the SpeechKit team. This method is similar to auto-tuning but uses an audio prepared by you instead of data provided in API requests. Submit it to the support team
This method shares the same restrictions as auto-tuning:
- A minimum of 10 hours of audio is required. Recommended number of hours: 50 or more.
- Training the recognition model takes about three months for Russian. For other languages, contact your account manager.