Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit Hybrid
  • System requirements
  • Service architecture
    • Authentication with the API
      • Overview
      • Recognizer
  • Pricing policy
  • SpeechKit Hybrid releases

In this article:

  • Calls Recognizer
  • RecognizeStreaming
  • StreamingRequest
  • StreamingOptions
  • RecognitionModelOptions
  • AudioFormatOptions
  • RawAudio
  • ContainerAudio
  • TextNormalizationOptions
  • LanguageRestrictionOptions
  • EouClassifierOptions
  • DefaultEouClassifier
  • ExternalEouClassifier
  • RecognitionClassifierOptions
  • RecognitionClassifier
  • SpeechAnalysisOptions
  • SpeakerLabelingOptions
  • AudioChunk
  • SilenceChunk
  • Eou
  • StreamingResponse
  • SessionUuid
  • AudioCursors
  • AlternativeUpdate
  • Alternative
  • Word
  • LanguageEstimation
  • EouUpdate
  • FinalRefinement
  • StatusCode
  • RecognitionClassifierUpdate
  • RecognitionClassifierResult
  • PhraseHighlight
  • RecognitionClassifierLabel
  • SpeakerAnalysis
  • AudioSegmentBoundaries
  • DescriptiveStatistics
  • Quantile
  • ConversationAnalysis
  • InterruptsEvaluation
  1. gRPC API reference
  2. Streaming recognition
  3. Recognizer

SpeechKit Hybrid Recognition API, gRPC: Recognizer

Written by
Yandex Cloud
Updated at January 25, 2024
  • Calls Recognizer
  • RecognizeStreaming
    • StreamingRequest
    • StreamingOptions
    • RecognitionModelOptions
    • AudioFormatOptions
    • RawAudio
    • ContainerAudio
    • TextNormalizationOptions
    • LanguageRestrictionOptions
    • EouClassifierOptions
    • DefaultEouClassifier
    • ExternalEouClassifier
    • RecognitionClassifierOptions
    • RecognitionClassifier
    • SpeechAnalysisOptions
    • SpeakerLabelingOptions
    • AudioChunk
    • SilenceChunk
    • Eou
    • StreamingResponse
    • SessionUuid
    • AudioCursors
    • AlternativeUpdate
    • Alternative
    • Word
    • LanguageEstimation
    • EouUpdate
    • FinalRefinement
    • StatusCode
    • RecognitionClassifierUpdate
    • RecognitionClassifierResult
    • PhraseHighlight
    • RecognitionClassifierLabel
    • SpeakerAnalysis
    • AudioSegmentBoundaries
    • DescriptiveStatistics
    • Quantile
    • ConversationAnalysis
    • InterruptsEvaluation

A set of methods for voice recognition.

Call Description
RecognizeStreaming Expects audio in real-time

Calls RecognizerCalls Recognizer

RecognizeStreamingRecognizeStreaming

Expects audio in real-time

rpc RecognizeStreaming (stream StreamingRequest) returns (stream StreamingResponse)

StreamingRequestStreamingRequest

Field Description
Event oneof: session_options, chunk, silence_chunk or eou
  session_options StreamingOptions
Session options. Should be the first message from user.
  chunk AudioChunk
Chunk with audio data.
  silence_chunk SilenceChunk
Chunk with silence.
  eou Eou
Request to end current utterance. Works only with external EOU detector.

StreamingOptionsStreamingOptions

Field Description
recognition_model RecognitionModelOptions
Configuration for speech recognition model.
eou_classifier EouClassifierOptions
Configuration for end of utterance detection model.
recognition_classifier RecognitionClassifierOptions
Configuration for classifiers over speech recognition.
speech_analysis SpeechAnalysisOptions
Configuration for speech analysis over speech recognition.
speaker_labeling SpeakerLabelingOptions
Configuration for speaker labeling

RecognitionModelOptionsRecognitionModelOptions

Field Description
model string
Sets the recognition model for the cloud version of SpeechKit. Possible values: 'general', 'general:rc', 'general:deprecated'. The model is ignored for SpeechKit Hybrid.
audio_format AudioFormatOptions
Specified input audio.
text_normalization TextNormalizationOptions
Text normalization options.
language_restriction LanguageRestrictionOptions
Possible languages in audio.
audio_processing_type enum AudioProcessingType
How to deal with audio data (in real time, after all data is received, etc). Default is REAL_TIME.
  • REAL_TIME: Process audio in mode optimized for real-time recognition, i.e. send partials and final responses as soon as possible
  • FULL_DATA: Process audio after all data was received

AudioFormatOptionsAudioFormatOptions

Field Description
AudioFormat oneof: raw_audio or container_audio
  raw_audio RawAudio
Audio without container.
  container_audio ContainerAudio
Audio is wrapped in container.

RawAudioRawAudio

Field Description
audio_encoding enum AudioEncoding
Type of audio encoding
  • LINEAR16_PCM: Audio bit depth 16-bit signed little-endian (Linear PCM).
sample_rate_hertz int64
PCM sample rate
audio_channel_count int64
PCM channel count. Currently only single channel audio is supported in real-time recognition.

ContainerAudioContainerAudio

Field Description
container_audio_type enum ContainerAudioType
Type of audio container.
  • WAV: Audio bit depth 16-bit signed little-endian (Linear PCM).
  • OGG_OPUS: Data is encoded using the OPUS audio codec and compressed using the OGG container format.

TextNormalizationOptionsTextNormalizationOptions

Field Description
text_normalization enum TextNormalization
Normalization
  • TEXT_NORMALIZATION_ENABLED: Enable normalization
  • TEXT_NORMALIZATION_DISABLED: Disable normalization
profanity_filter bool
Profanity filter (default: false).
literature_text bool
Rewrite text in literature style (default: false).
phone_formatting_mode enum PhoneFormattingMode
Define phone formatting mode
  • PHONE_FORMATTING_MODE_DISABLED: Disable phone formatting

LanguageRestrictionOptionsLanguageRestrictionOptions

Field Description
restriction_type enum LanguageRestrictionType
Language restriction type
  • WHITELIST: The allowing list. The incoming audio can contain only the listed languages.
  • BLACKLIST: The forbidding list. The incoming audio cannot contain the listed languages.
language_code[] string
The list of language codes to restrict recognition in the case of an auto model

EouClassifierOptionsEouClassifierOptions

Field Description
Classifier oneof: default_classifier or external_classifier
Type of EOU classifier.
  default_classifier DefaultEouClassifier
EOU classifier provided by SpeechKit. Default.
  external_classifier ExternalEouClassifier
EOU is enforced by external messages from user.

DefaultEouClassifierDefaultEouClassifier

Field Description
type enum EouSensitivity
EOU sensitivity. Currently two levels, faster with more error and more conservative (our default).
max_pause_between_words_hint_ms int64
Hint for max pause between words. Our EOU detector could use this information to distinguish between end of utterance and slow speech (like one two three, etc).

ExternalEouClassifierExternalEouClassifier

Empty.

RecognitionClassifierOptionsRecognitionClassifierOptions

Field Description
classifiers[] RecognitionClassifier
List of classifiers to use

RecognitionClassifierRecognitionClassifier

Field Description
classifier string
Classifier name
triggers[] enum TriggerType
Describes the types of responses to which the classification results will come
  • ON_UTTERANCE: Apply classifier to utterance responses
  • ON_FINAL: Apply classifier to final responses
  • ON_PARTIAL: Apply classifier to partial responses

SpeechAnalysisOptionsSpeechAnalysisOptions

Field Description
enable_speaker_analysis bool
Analyse speech for every speaker
enable_conversation_analysis bool
Analyse conversation of two speakers
descriptive_statistics_quantiles[] double
Quantile levels in range (0, 1) for descriptive statistics

SpeakerLabelingOptionsSpeakerLabelingOptions

Field Description
speaker_labeling enum SpeakerLabeling
Specifies the execution of speaker labeling. Default is SPEAKER_LABELING_DISABLED.
  • SPEAKER_LABELING_ENABLED: Enable speaker labeling
  • SPEAKER_LABELING_DISABLED: Disable speaker labeling

AudioChunkAudioChunk

Field Description
data bytes
Bytes with audio data.

SilenceChunkSilenceChunk

Field Description
duration_ms int64
Duration of silence chunk in ms.

EouEou

Empty.

StreamingResponseStreamingResponse

Field Description
session_uuid SessionUuid
Session identifier
audio_cursors AudioCursors
Progress bar for stream session recognition: how many data we obtained; final and partial times; etc.
response_wall_time_ms int64
Wall clock on server side. This is time when server wrote results to stream
Event oneof: partial, final, eou_update, final_refinement, status_code, classifier_update, speaker_analysis or conversation_analysis
  partial AlternativeUpdate
Partial results, server will send them regularly after enough audio data was received from user. This are current text estimation from final_time_ms to partial_time_ms. Could change after new data will arrive.
  final AlternativeUpdate
Final results, the recognition is now fixed until final_time_ms. For now, final is sent only if the EOU event was triggered. This could be change in future releases.
  eou_update EouUpdate
After EOU classifier, send the message with final, send the EouUpdate with time of EOU before eou_update we send final with the same time. there could be several finals before eou update.
  final_refinement FinalRefinement
For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Final normalization will introduce additional latency.
  status_code StatusCode
Status messages, send by server with fixed interval (keep-alive).
  classifier_update RecognitionClassifierUpdate
Result of the triggered classifier
  speaker_analysis SpeakerAnalysis
Speech statistics for every speaker
  conversation_analysis ConversationAnalysis
Conversation statistics
channel_tag string
Tag for distinguish audio channels.

SessionUuidSessionUuid

Field Description
uuid string
Internal session identifier.
user_request_id string
User session identifier.

AudioCursorsAudioCursors

Field Description
received_data_ms int64
Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server.
reset_time_ms int64
Input stream reset data.
partial_time_ms int64
How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data to update recognition results (includes silence as well).
final_time_ms int64
Time of last final. This cursor is moved when server decides that recognition from start of audio until final_time_ms will not change anymore usually this even is followed by EOU detection (but this could change in future).
final_index int64
This is index of last final server send. Incremented after each new final.
eou_time_ms int64
Estimated time of EOU. Cursor is updated after each new EOU is sent. For external classifier this equals to received_data_ms at the moment EOU event arrives. For internal classifier this is estimation of time. The time is not exact and has the same guarantees as word timings.

AlternativeUpdateAlternativeUpdate

Field Description
alternatives[] Alternative
List of hypothesis for timeframes.
channel_tag string

AlternativeAlternative

Field Description
words[] Word
Words in time frame.
text string
Text in time frame.
start_time_ms int64
Start of time frame.
end_time_ms int64
End of time frame.
confidence double
The hypothesis confidence. Currently is not used.
languages[] LanguageEstimation
Distribution over possible languages.

WordWord

Field Description
text string
Word text.
start_time_ms int64
Estimation of word start time in ms.
end_time_ms int64
Estimation of word end time in ms.

LanguageEstimationLanguageEstimation

Field Description
language_code string
Language code in ISO 639-1 format.
probability double
Estimation of language probability.

EouUpdateEouUpdate

Field Description
time_ms int64
EOU estimated time.

FinalRefinementFinalRefinement

Field Description
final_index int64
Index of final for which server sends additional information.
Type oneof: normalized_text
Type of refinement.
  normalized_text AlternativeUpdate
Normalized text instead of raw one.

StatusCodeStatusCode

Field Description
code_type enum CodeType
Code type.
  • WORKING: All good.
  • WARNING: For example, if speech is sent not in real time or context is unknown and we've made fallback.
  • CLOSED: After session was closed.
message string
Human readable message.

RecognitionClassifierUpdateRecognitionClassifierUpdate

Field Description
window_type enum WindowType
Response window type
  • LAST_UTTERANCE: The result of applying the classifier to the last utterance response
  • LAST_FINAL: The result of applying the classifier to the last final response
  • LAST_PARTIAL: The result of applying the classifier to the last partial response
start_time_ms int64
Start time of the audio segment used for classification
end_time_ms int64
End time of the audio segment used for classification
classifier_result RecognitionClassifierResult
Result for dictionary-based classifier

RecognitionClassifierResultRecognitionClassifierResult

Field Description
classifier string
Name of the triggered classifier
highlights[] PhraseHighlight
List of highlights, i.e. parts of phrase that determine the result of the classification
labels[] RecognitionClassifierLabel
Classifier predictions

PhraseHighlightPhraseHighlight

Field Description
text string
Text transcription of the highlighted audio segment
start_time_ms int64
Start time of the highlighted audio segment
end_time_ms int64
End time of the highlighted audio segment

RecognitionClassifierLabelRecognitionClassifierLabel

Field Description
label string
The label of the class predicted by the classifier
confidence double
The prediction confidence

SpeakerAnalysisSpeakerAnalysis

Field Description
speaker_tag string
Speaker tag
window_type enum WindowType
Response window type
  • TOTAL: Stats for all received audio.
  • LAST_UTTERANCE: Stats for last utterance.
speech_boundaries AudioSegmentBoundaries
Audio segment boundaries
total_speech_ms int64
Total speech duration
speech_ratio double
Speech ratio within audio segment
total_silence_ms int64
Total silence duration
silence_ratio double
Silence ratio within audio segment
words_count int64
Number of words in recognized speech
letters_count int64
Number of letters in recognized speech
words_per_second DescriptiveStatistics
Descriptive statistics for words per second distribution
letters_per_second DescriptiveStatistics
Descriptive statistics for letters per second distribution
words_per_utterance DescriptiveStatistics
Descriptive statistics for words per utterance distribution
letters_per_utterance DescriptiveStatistics
Descriptive statistics for letters per utterance distribution
utterance_count int64
Number of utterances
utterance_duration_estimation DescriptiveStatistics
Descriptive statistics for utterance duration distribution

AudioSegmentBoundariesAudioSegmentBoundaries

Field Description
start_time_ms int64
Audio segment start time
end_time_ms int64
Audio segment end time

DescriptiveStatisticsDescriptiveStatistics

Field Description
min double
Minimum observed value
max double
Maximum observed value
mean double
Estimated mean of distribution
std double
Estimated standard deviation of distribution
quantiles[] Quantile
List of evaluated quantiles

QuantileQuantile

Field Description
level double
Quantile level in range (0, 1)
value double
Quantile value

ConversationAnalysisConversationAnalysis

Field Description
conversation_boundaries AudioSegmentBoundaries
Audio segment boundaries
total_simultaneous_silence_duration_ms int64
Total simultaneous silence duration
total_simultaneous_silence_ratio double
Simultaneous silence ratio within audio segment
simultaneous_silence_duration_estimation DescriptiveStatistics
Descriptive statistics for simultaneous silence duration distribution
total_simultaneous_speech_duration_ms int64
Total simultaneous speech duration
total_simultaneous_speech_ratio double
Simultaneous speech ratio within audio segment
simultaneous_speech_duration_estimation DescriptiveStatistics
Descriptive statistics for simultaneous speech duration distribution
speaker_interrupts[] InterruptsEvaluation
Interrupts description for every speaker
total_speech_duration_ms int64
Total speech duration, including both simultaneous and separate speech
total_speech_ratio double
Total speech ratio within audio segment

InterruptsEvaluationInterruptsEvaluation

Field Description
speaker_tag string
Speaker tag
interrupts_count int64
Number of interrupts made by the speaker
interrupts_duration_ms int64
Total duration of all interrupts
interrupts[] AudioSegmentBoundaries
Boundaries for every interrupt

Was the article helpful?

Previous
Overview
Next
Pricing policy
© 2025 Direct Cursus Technology L.L.C.