Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • About the technology
    • Supported languages
    • Streaming recognition
    • Recognition result normalization
    • Analyzing recognition results
    • Speaker labeling
    • Extending a speech recognition model
    • Uploading fine-tuning data for a speech recognition model
    • Detecting the end of utterance
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy

In this article:

  • Recognition methods
  • Which recognition to choose
  • Recognition process
  • Recognition accuracy
  • Use cases
  1. Speech recognition
  2. About the technology

Speech recognition

Written by
Yandex Cloud
Updated at April 30, 2025
  • Recognition methods
    • Which recognition to choose
  • Recognition process
  • Recognition accuracy
  • Use cases

Speech recognition is speech-to-text (STT) conversion.

To work with SpeechKit, you can access it via the API or Playground. For more information about working with the Yandex Cloud API, see API concepts.

The service is available at stt.api.cloud.yandex.net:443.

You can also work with SpeechKit using the Python SDK. It is implemented based on the SpeechKit API v3.

To try out the Text-to-Speech and Speech-to-Text product demos, visit the SpeechKit page on our website.

Recognition methodsRecognition methods

SpeechKit provides two ways of improving the quality of speech recognition:

  1. Streaming recognition is used for real-time speech recognition. During streaming recognition, SpeechKit receives short audio fragments and sends the results, including intermediate ones, over a single connection.
  2. Audio file recognition. SpeechKit can recognize audio recordings in synchronous and asynchronous modes.
    • Synchronous mode has strict limitations on file size and duration and is suitable for single-channel audio fragments of up to 30 seconds.
    • Asynchronous mode can process multi-channel audio fragments. Maximum recording duration: 4 hours.

Which recognition to chooseWhich recognition to choose

Streaming recognition Synchronous recognition Asynchronous recognition
Use cases Phone assistants and robots
Virtual assistants
Virtual assistants
Voice control
Speech recognition of short voice messages in messengers
Transcribing audio calls and presentations
Subtitling
Call center script compliance monitoring
Identifying successful scripts
Evaluating performance of call center agents
Input data Real-time voice Pre-recorded short single-channel audio files Pre-recorded multi-channel and long audio files
How it works Exchanging messages with the server over a single connection Request — quick response Request — delayed response
Supported APIs gRPC v2
gRPC v3
REST v1 REST v2
gRPC v3
Maximum duration of audio data 5 minutes 30 seconds 4 hours
Maximum amount of transmitted data 10 MB 1 MB 1 GB
Number of recognition channels 1 1 2

Recognition processRecognition process

Audio is recognized in three stages:

  1. The acoustic model determines which set of low-level attributes corresponds to the audio signal.
  2. The language model uses the acoustic model output to generate the text by words.
  3. The service performs text processing: punctuation, converting numerals into numbers, and more.

Recognition accuracyRecognition accuracy

Recognition accuracy depends on the recognition model. You can improve recognition accuracy of the model by providing data for tuning of the model. For more information about model tuning, see Extending a speech recognition model.

The accuracy of speech recognition is also affected by:

  • Original sound quality.
  • Audio encoding quality.
  • Speech intelligibility and rate.
  • Utterance complexity and length.

Use casesUse cases

  • Audio file streaming recognition using the API v3
  • Streaming speech recognition with auto language detection in the API v3
  • Asynchronous WAV audio file recognition using the API v3
  • Example of using the API v1 for synchronous recognition

See alsoSee also

  • Supported audio formats
  • Supported languages and recognition models
  • Streaming speech recognition
  • Detecting the end of utterance
  • Recognition result normalization
  • Synchronous audio recognition
  • Asynchronous recognition
  • Extending a speech recognition model

Was the article helpful?

Previous
Speech synthesis from SSML text, API v1
Next
Supported languages
© 2025 Direct Cursus Technology L.L.C.