Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • Overview
    • API authentication
        • Overview
          • Overview
          • RecognizeFile
          • GetRecognition
          • DeleteRecognition
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy
  • Audit Trails events

In this article:

  • HTTP request
  • Query parameters
  • Response
  • SessionUuid
  • AudioCursors
  • AlternativeUpdate
  • Alternative
  • Word
  • LanguageEstimation
  • EouUpdate
  • FinalRefinement
  • StatusCode
  • RecognitionClassifierUpdate
  • RecognitionClassifierResult
  • PhraseHighlight
  • RecognitionClassifierLabel
  • SpeakerAnalysis
  • AudioSegmentBoundaries
  • DescriptiveStatistics
  • Quantile
  • ConversationAnalysis
  • InterruptsEvaluation
  • Summarization
  • SummarizationPropertyResult
  • ContentUsage
  1. API references
  2. Recognition
  3. API v3 REST
  4. AsyncRecognizer
  5. GetRecognition

SpeechKit Recognition API v3, REST: AsyncRecognizer.GetRecognition

Written by
Yandex Cloud
Updated at October 3, 2025
  • HTTP request
  • Query parameters
  • Response
  • SessionUuid
  • AudioCursors
  • AlternativeUpdate
  • Alternative
  • Word
  • LanguageEstimation
  • EouUpdate
  • FinalRefinement
  • StatusCode
  • RecognitionClassifierUpdate
  • RecognitionClassifierResult
  • PhraseHighlight
  • RecognitionClassifierLabel
  • SpeakerAnalysis
  • AudioSegmentBoundaries
  • DescriptiveStatistics
  • Quantile
  • ConversationAnalysis
  • InterruptsEvaluation
  • Summarization
  • SummarizationPropertyResult
  • ContentUsage

Gets results of asynchronous recognition after finishing the operation.

HTTP requestHTTP request

GET https://stt.api.cloud.yandex.net/stt/v3/getRecognition

Query parametersQuery parameters

Field

Description

operationId

string

ResponseResponse

HTTP Code: 200 - OK

{
  "sessionUuid": {
    "uuid": "string",
    "userRequestId": "string"
  },
  "audioCursors": {
    "receivedDataMs": "string",
    "resetTimeMs": "string",
    "partialTimeMs": "string",
    "finalTimeMs": "string",
    "finalIndex": "string",
    "eouTimeMs": "string"
  },
  "responseWallTimeMs": "string",
  // Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`
  "partial": {
    "alternatives": [
      {
        "words": [
          {
            "text": "string",
            "startTimeMs": "string",
            "endTimeMs": "string"
          }
        ],
        "text": "string",
        "startTimeMs": "string",
        "endTimeMs": "string",
        "confidence": "string",
        "languages": [
          {
            "languageCode": "string",
            "probability": "string"
          }
        ]
      }
    ],
    "channelTag": "string"
  },
  "final": {
    "alternatives": [
      {
        "words": [
          {
            "text": "string",
            "startTimeMs": "string",
            "endTimeMs": "string"
          }
        ],
        "text": "string",
        "startTimeMs": "string",
        "endTimeMs": "string",
        "confidence": "string",
        "languages": [
          {
            "languageCode": "string",
            "probability": "string"
          }
        ]
      }
    ],
    "channelTag": "string"
  },
  "eouUpdate": {
    "timeMs": "string"
  },
  "finalRefinement": {
    "finalIndex": "string",
    // Includes only one of the fields `normalizedText`
    "normalizedText": {
      "alternatives": [
        {
          "words": [
            {
              "text": "string",
              "startTimeMs": "string",
              "endTimeMs": "string"
            }
          ],
          "text": "string",
          "startTimeMs": "string",
          "endTimeMs": "string",
          "confidence": "string",
          "languages": [
            {
              "languageCode": "string",
              "probability": "string"
            }
          ]
        }
      ],
      "channelTag": "string"
    }
    // end of the list of possible fields
  },
  "statusCode": {
    "codeType": "string",
    "message": "string"
  },
  "classifierUpdate": {
    "windowType": "string",
    "startTimeMs": "string",
    "endTimeMs": "string",
    "classifierResult": {
      "classifier": "string",
      "highlights": [
        {
          "text": "string",
          "startTimeMs": "string",
          "endTimeMs": "string"
        }
      ],
      "labels": [
        {
          "label": "string",
          "confidence": "string"
        }
      ]
    }
  },
  "speakerAnalysis": {
    "speakerTag": "string",
    "windowType": "string",
    "speechBoundaries": {
      "startTimeMs": "string",
      "endTimeMs": "string"
    },
    "totalSpeechMs": "string",
    "speechRatio": "string",
    "totalSilenceMs": "string",
    "silenceRatio": "string",
    "wordsCount": "string",
    "lettersCount": "string",
    "wordsPerSecond": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "lettersPerSecond": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "wordsPerUtterance": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "lettersPerUtterance": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "utteranceCount": "string",
    "utteranceDurationEstimation": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    }
  },
  "conversationAnalysis": {
    "conversationBoundaries": {
      "startTimeMs": "string",
      "endTimeMs": "string"
    },
    "totalSimultaneousSilenceDurationMs": "string",
    "totalSimultaneousSilenceRatio": "string",
    "simultaneousSilenceDurationEstimation": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "totalSimultaneousSpeechDurationMs": "string",
    "totalSimultaneousSpeechRatio": "string",
    "simultaneousSpeechDurationEstimation": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "speakerInterrupts": [
      {
        "speakerTag": "string",
        "interruptsCount": "string",
        "interruptsDurationMs": "string",
        "interrupts": [
          {
            "startTimeMs": "string",
            "endTimeMs": "string"
          }
        ]
      }
    ],
    "totalSpeechDurationMs": "string",
    "totalSpeechRatio": "string"
  },
  "summarization": {
    "results": [
      {
        "response": "string"
      }
    ],
    "contentUsage": {
      "inputTextTokens": "string",
      "completionTokens": "string",
      "totalTokens": "string"
    }
  },
  // end of the list of possible fields
  "channelTag": "string"
}

Responses from server.
Each response contains session UUID, AudioCursors, and specific event.

Field

Description

sessionUuid

SessionUuid

Session identifier.

audioCursors

AudioCursors

Progress bar for stream session recognition: how many data we obtained; final and partial times; etc.

responseWallTimeMs

string (int64)

Wall clock on server side. This is time when server wrote results to stream.

partial

AlternativeUpdate

Partial results, server will send them regularly after enough audio data was received from user.
This is the current text estimation from final_time_ms to partial_time_ms. Could change after new data will arrive.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis, summarization.

final

AlternativeUpdate

Final results, the recognition is now fixed until final_time_ms. For now, final is sent only if the EOU event was triggered. This behavior could be changed in future releases.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis, summarization.

eouUpdate

EouUpdate

After EOU classifier, send the message with final, send the EouUpdate with time of EOU
before eou_update we send final with the same time. there could be several finals before eou update.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis, summarization.

finalRefinement

FinalRefinement

For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing).
Final normalization will introduce additional latency.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis, summarization.

statusCode

StatusCode

Status messages, send by server with fixed interval (keep-alive).

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis, summarization.

classifierUpdate

RecognitionClassifierUpdate

Result of the triggered classifier.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis, summarization.

speakerAnalysis

SpeakerAnalysis

Speech statistics for every speaker.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis, summarization.

conversationAnalysis

ConversationAnalysis

Conversation statistics.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis, summarization.

summarization

Summarization

Summary.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis, summarization.

channelTag

string

Tag for distinguish audio channels.

SessionUuidSessionUuid

Session identifier.

Field

Description

uuid

string

Internal session identifier.

userRequestId

string

User session identifier.

AudioCursorsAudioCursors

AudioCursors are state of ASR recognition stream.

Field

Description

receivedDataMs

string (int64)

Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server.

resetTimeMs

string (int64)

Input stream reset data.

partialTimeMs

string (int64)

How much audio was processed. This time includes trimming silences as well.
This cursor is moved after server received enough data to update recognition results (includes silence as well).

finalTimeMs

string (int64)

Time of last final. This cursor is moved when server decides that recognition from start of audio until final_time_ms will not change anymore
usually this event is followed by EOU detection. This behavior could change in future.

finalIndex

string (int64)

This is index of last final server send. Incremented after each new final.

eouTimeMs

string (int64)

Estimated time of EOU. Cursor is updated after each new EOU is sent.
For external classifier this equals to receivedDataMs at the moment EOU event arrives.
For internal classifier this is estimation of time. The time is not exact and has the same guarantees as word timings.

AlternativeUpdateAlternativeUpdate

Update of hypothesis.

Field

Description

alternatives[]

Alternative

List of hypothesis for timeframes.

channelTag

string

AlternativeAlternative

Recognition of specific time frame.

Field

Description

words[]

Word

Words in time frame.

text

string

Text in time frame.

startTimeMs

string (int64)

Start of time frame.

endTimeMs

string (int64)

End of time frame.

confidence

string

The hypothesis confidence. Currently is not used.

languages[]

LanguageEstimation

Distribution over possible languages.

WordWord

Recognized word.

Field

Description

text

string

Word text.

startTimeMs

string (int64)

Estimation of word start time in ms.

endTimeMs

string (int64)

Estimation of word end time in ms.

LanguageEstimationLanguageEstimation

Estimation of language and its probability.

Field

Description

languageCode

string

Language tag in IETF BCP 47 format, consisting of ISO 639-1 language code and ISO 3166-1 country code (e.g., en-US, ru-RU).

probability

string

Estimation of language probability.

EouUpdateEouUpdate

Update information for external End of Utterance.

Field

Description

timeMs

string (int64)

EOU estimated time.

FinalRefinementFinalRefinement

Refinement for final hypo. For example, text normalization is refinement.

Field

Description

finalIndex

string (int64)

Index of final for which server sends additional information.

normalizedText

AlternativeUpdate

Normalized text instead of raw one.

Includes only one of the fields normalizedText.

Type of refinement.

StatusCodeStatusCode

Status message.

Field

Description

codeType

enum (CodeType)

Code type.

  • CODE_TYPE_UNSPECIFIED
  • WORKING: All good.
  • WARNING: For example, if speech is sent not in real-time or context is unknown and we've made fallback.
  • CLOSED: After session was closed.

message

string

Human readable message.

RecognitionClassifierUpdateRecognitionClassifierUpdate

Field

Description

windowType

enum (WindowType)

Response window type.

  • WINDOW_TYPE_UNSPECIFIED
  • LAST_UTTERANCE: The result of applying the classifier to the last utterance response.
  • LAST_FINAL: The result of applying the classifier to the last final response.
  • LAST_PARTIAL: The result of applying the classifier to the last partial response.

startTimeMs

string (int64)

Start time of the audio segment used for classification.

endTimeMs

string (int64)

End time of the audio segment used for classification.

classifierResult

RecognitionClassifierResult

Result for dictionary-based classifier.

RecognitionClassifierResultRecognitionClassifierResult

Field

Description

classifier

string

Name of the triggered classifier.

highlights[]

PhraseHighlight

List of highlights, i.e. parts of phrase that determine the result of the classification.

labels[]

RecognitionClassifierLabel

Classifier predictions.

PhraseHighlightPhraseHighlight

Field

Description

text

string

Text transcription of the highlighted audio segment.

startTimeMs

string (int64)

Start time of the highlighted audio segment.

endTimeMs

string (int64)

End time of the highlighted audio segment.

RecognitionClassifierLabelRecognitionClassifierLabel

Field

Description

label

string

The label of the class predicted by the classifier.

confidence

string

The prediction confidence.

SpeakerAnalysisSpeakerAnalysis

Field

Description

speakerTag

string

Speaker tag.

windowType

enum (WindowType)

Response window type.

  • WINDOW_TYPE_UNSPECIFIED
  • TOTAL: Stats for all received audio.
  • LAST_UTTERANCE: Stats for last utterance.

speechBoundaries

AudioSegmentBoundaries

Audio segment boundaries.

totalSpeechMs

string (int64)

Total speech duration.

speechRatio

string

Speech ratio within audio segment.

totalSilenceMs

string (int64)

Total duration of silence.

silenceRatio

string

Silence ratio within audio segment.

wordsCount

string (int64)

Number of words in recognized speech.

lettersCount

string (int64)

Number of letters in recognized speech.

wordsPerSecond

DescriptiveStatistics

Descriptive statistics for words per second distribution.

lettersPerSecond

DescriptiveStatistics

Descriptive statistics for letters per second distribution.

wordsPerUtterance

DescriptiveStatistics

Descriptive statistics for words per utterance distribution.

lettersPerUtterance

DescriptiveStatistics

Descriptive statistics for letters per utterance distribution.

utteranceCount

string (int64)

Number of utterances

utteranceDurationEstimation

DescriptiveStatistics

Descriptive statistics for utterance duration distribution

AudioSegmentBoundariesAudioSegmentBoundaries

Field

Description

startTimeMs

string (int64)

Audio segment start time.

endTimeMs

string (int64)

Audio segment end time.

DescriptiveStatisticsDescriptiveStatistics

Field

Description

min

string

Minimum observed value.

max

string

Maximum observed value.

mean

string

Estimated mean of distribution.

std

string

Estimated standard deviation of distribution.

quantiles[]

Quantile

List of evaluated quantiles.

QuantileQuantile

Field

Description

level

string

Quantile level in range (0, 1).

value

string

Quantile value.

ConversationAnalysisConversationAnalysis

Field

Description

conversationBoundaries

AudioSegmentBoundaries

Audio segment boundaries.

totalSimultaneousSilenceDurationMs

string (int64)

Total simultaneous silence duration.

totalSimultaneousSilenceRatio

string

Simultaneous silence ratio within audio segment.

simultaneousSilenceDurationEstimation

DescriptiveStatistics

Descriptive statistics for simultaneous silence duration distribution.

totalSimultaneousSpeechDurationMs

string (int64)

Total simultaneous speech duration.

totalSimultaneousSpeechRatio

string

Simultaneous speech ratio within audio segment.

simultaneousSpeechDurationEstimation

DescriptiveStatistics

Descriptive statistics for simultaneous speech duration distribution.

speakerInterrupts[]

InterruptsEvaluation

Interrupts description for every speaker.

totalSpeechDurationMs

string (int64)

Total speech duration, including both simultaneous and separate speech.

totalSpeechRatio

string

Total speech ratio within audio segment.

InterruptsEvaluationInterruptsEvaluation

Field

Description

speakerTag

string

Speaker tag.

interruptsCount

string (int64)

Number of interrupts made by the speaker.

interruptsDurationMs

string (int64)

Total duration of all interrupts.

interrupts[]

AudioSegmentBoundaries

Boundaries for every interrupt.

SummarizationSummarization

Field

Description

results[]

SummarizationPropertyResult

A list of summarizations of transcription.

contentUsage

ContentUsage

A set of statistics describing the number of content tokens used by the completion model.

SummarizationPropertyResultSummarizationPropertyResult

Represents summarization response entry for transcription.

Field

Description

response

string

Summarization response text.

ContentUsageContentUsage

An object representing the number of content tokens used by the completion model.

Field

Description

inputTextTokens

string (int64)

The number of tokens in the textual part of the model input.

completionTokens

string (int64)

The number of tokens in the generated completion.

totalTokens

string (int64)

The total number of tokens, including all input tokens and all generated tokens.

Was the article helpful?

Previous
RecognizeFile
Next
DeleteRecognition
© 2025 Direct Cursus Technology L.L.C.