Yandex SpeechKit technology overview

Written by

Updated at November 20, 2024

Yandex SpeechKit voice technologies are up to any task related to human speech. SpeechKit can recognize speech either in real time or from pre-recorded audio files while automatically detecting the speaker's language. It can also vocalize pattern phrases and long texts with SpeechKit standard voices.

SpeechKit runs using the API interfaces. Depending on the task, you can use the gRPC or REST interfaces. For more information about API implementations in Yandex Cloud, see Yandex Cloud API concepts.

The table provides the most common SpeechKit use cases so that you can choose the appropriate technologies and configure them to meet your needs.

Description	Recommended technologies	Features and settings
Voice robot
Full or partial automation of telephone communications with customers.	For user input: Streaming recognition. For the system's response: Speech synthesis using standard voices and Brand Voices created specially for you.	You can get both intermediate and final recognition results Controlling pronunciation with synthesized text markup Pattern-based speech synthesis
Voice analysis Operator performance quality control
Transcribing and further analysis of audio recordings of dialogs between customers and call center operators or robots.	To recognize pre-recorded audio files: Asynchronous recognition of audio files.	Timestamps of the start and end of a word in the recognition results Recognition result normalization Deferred mode for asynchronous recognition of audio files Quotas and limits in SpeechKit
Voice control in apps and smart devices Voice assistant
The user requests an action or search using voice and the service responds with an action with a voice comment or an image.	For user input: Streaming recognition. For the system's response: Speech synthesis using standard voices and Brand Voices.	You can get both intermediate and final recognition results Controlling pronunciation with synthesized text markup Recognition result normalization
Service adaptation to people with visual impairments
Voice control, voice hints and comments for visually impaired users.	For user input: Streaming recognition. For the system's response: Speech synthesis using standard voices and Brand Voices.	You can get both intermediate and final recognition results Controlling pronunciation with synthesized text markup
Recognizing audio recordings made during a meeting
Transcribing the audio recordings after the meeting is completed.	To recognize pre-recorded audio files: Asynchronous recognition of audio files.	Deferred mode for asynchronous recognition of audio files Quotas and limits in SpeechKit Timestamps of the start and end of a word in the recognition results Recognition result normalization
Voicing books and videos
Voicing a book or video with no human speaker involved.	Speech synthesis using standard voices and Brand Voices.	Controlling pronunciation with synthesized text markup Quotas and limits in SpeechKit
Recording the minutes of a meeting
Transcribing the meeting minutes in real time	To recognize the participants' speech: Streaming recognition.	You can get both intermediate and final recognition results Recognition result normalization
Video subtitles
Creating subtitles for recorded videos	To recognize an audio track: Asynchronous recognition of audio files.	Deferred mode for asynchronous recognition of audio files Timestamps of the start and end of a word in the recognition results Recognition result normalization Quotas and limits in SpeechKit
Broadcast subtitles
Transcribing broadcasts in real time.	To recognize the broadcast speech: Streaming recognition.	You can get both intermediate and final recognition results Recognition result normalization
Transcribing voice messages
Converting short voice messages to text in messengers	To recognize audio files: Synchronous recognition.	Recognition result settings.

Yandex SpeechKit technology overview

Was the article helpful?