Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • Overview
    • API authentication
        • Overview
          • Overview
          • RecognizeFile
          • GetRecognition
          • DeleteRecognition
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy

In this article:

  • HTTP request
  • Query parameters
  • Response
  • SessionUuid
  • AudioCursors
  • AlternativeUpdate
  • Alternative
  • Word
  • LanguageEstimation
  • EouUpdate
  • FinalRefinement
  • StatusCode
  • RecognitionClassifierUpdate
  • RecognitionClassifierResult
  • PhraseHighlight
  • RecognitionClassifierLabel
  • SpeakerAnalysis
  • AudioSegmentBoundaries
  • DescriptiveStatistics
  • Quantile
  • ConversationAnalysis
  • InterruptsEvaluation
  1. API references
  2. Recognition
  3. API v3 REST
  4. AsyncRecognizer
  5. GetRecognition

SpeechKit Recognition API v3, REST: AsyncRecognizer.GetRecognition

Written by
Yandex Cloud
Updated at February 24, 2025
  • HTTP request
  • Query parameters
  • Response
  • SessionUuid
  • AudioCursors
  • AlternativeUpdate
  • Alternative
  • Word
  • LanguageEstimation
  • EouUpdate
  • FinalRefinement
  • StatusCode
  • RecognitionClassifierUpdate
  • RecognitionClassifierResult
  • PhraseHighlight
  • RecognitionClassifierLabel
  • SpeakerAnalysis
  • AudioSegmentBoundaries
  • DescriptiveStatistics
  • Quantile
  • ConversationAnalysis
  • InterruptsEvaluation

HTTP request

GET https://stt.api.cloud.yandex.net/stt/v3/getRecognition

Query parameters

Field

Description

operationId

string

Response

HTTP Code: 200 - OK

{
  "sessionUuid": {
    "uuid": "string",
    "userRequestId": "string"
  },
  "audioCursors": {
    "receivedDataMs": "string",
    "resetTimeMs": "string",
    "partialTimeMs": "string",
    "finalTimeMs": "string",
    "finalIndex": "string",
    "eouTimeMs": "string"
  },
  "responseWallTimeMs": "string",
  // Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`
  "partial": {
    "alternatives": [
      {
        "words": [
          {
            "text": "string",
            "startTimeMs": "string",
            "endTimeMs": "string"
          }
        ],
        "text": "string",
        "startTimeMs": "string",
        "endTimeMs": "string",
        "confidence": "string",
        "languages": [
          {
            "languageCode": "string",
            "probability": "string"
          }
        ]
      }
    ],
    "channelTag": "string"
  },
  "final": {
    "alternatives": [
      {
        "words": [
          {
            "text": "string",
            "startTimeMs": "string",
            "endTimeMs": "string"
          }
        ],
        "text": "string",
        "startTimeMs": "string",
        "endTimeMs": "string",
        "confidence": "string",
        "languages": [
          {
            "languageCode": "string",
            "probability": "string"
          }
        ]
      }
    ],
    "channelTag": "string"
  },
  "eouUpdate": {
    "timeMs": "string"
  },
  "finalRefinement": {
    "finalIndex": "string",
    // Includes only one of the fields `normalizedText`
    "normalizedText": {
      "alternatives": [
        {
          "words": [
            {
              "text": "string",
              "startTimeMs": "string",
              "endTimeMs": "string"
            }
          ],
          "text": "string",
          "startTimeMs": "string",
          "endTimeMs": "string",
          "confidence": "string",
          "languages": [
            {
              "languageCode": "string",
              "probability": "string"
            }
          ]
        }
      ],
      "channelTag": "string"
    }
    // end of the list of possible fields
  },
  "statusCode": {
    "codeType": "string",
    "message": "string"
  },
  "classifierUpdate": {
    "windowType": "string",
    "startTimeMs": "string",
    "endTimeMs": "string",
    "classifierResult": {
      "classifier": "string",
      "highlights": [
        {
          "text": "string",
          "startTimeMs": "string",
          "endTimeMs": "string"
        }
      ],
      "labels": [
        {
          "label": "string",
          "confidence": "string"
        }
      ]
    }
  },
  "speakerAnalysis": {
    "speakerTag": "string",
    "windowType": "string",
    "speechBoundaries": {
      "startTimeMs": "string",
      "endTimeMs": "string"
    },
    "totalSpeechMs": "string",
    "speechRatio": "string",
    "totalSilenceMs": "string",
    "silenceRatio": "string",
    "wordsCount": "string",
    "lettersCount": "string",
    "wordsPerSecond": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "lettersPerSecond": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "wordsPerUtterance": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "lettersPerUtterance": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "utteranceCount": "string",
    "utteranceDurationEstimation": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    }
  },
  "conversationAnalysis": {
    "conversationBoundaries": {
      "startTimeMs": "string",
      "endTimeMs": "string"
    },
    "totalSimultaneousSilenceDurationMs": "string",
    "totalSimultaneousSilenceRatio": "string",
    "simultaneousSilenceDurationEstimation": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "totalSimultaneousSpeechDurationMs": "string",
    "totalSimultaneousSpeechRatio": "string",
    "simultaneousSpeechDurationEstimation": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "speakerInterrupts": [
      {
        "speakerTag": "string",
        "interruptsCount": "string",
        "interruptsDurationMs": "string",
        "interrupts": [
          {
            "startTimeMs": "string",
            "endTimeMs": "string"
          }
        ]
      }
    ],
    "totalSpeechDurationMs": "string",
    "totalSpeechRatio": "string"
  },
  // end of the list of possible fields
  "channelTag": "string"
}

Responses from server.
Each response contains session uuid
AudioCursors
plus specific event

Field

Description

sessionUuid

SessionUuid

Session identifier

audioCursors

AudioCursors

Progress bar for stream session recognition: how many data we obtained; final and partial times; etc.

responseWallTimeMs

string (int64)

Wall clock on server side. This is time when server wrote results to stream

partial

AlternativeUpdate

Partial results, server will send them regularly after enough audio data was received from user. This are current text estimation
from final_time_ms to partial_time_ms. Could change after new data will arrive.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis.

final

AlternativeUpdate

Final results, the recognition is now fixed until final_time_ms. For now, final is sent only if the EOU event was triggered. This could be change in future releases.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis.

eouUpdate

EouUpdate

After EOU classifier, send the message with final, send the EouUpdate with time of EOU
before eou_update we send final with the same time. there could be several finals before eou update.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis.

finalRefinement

FinalRefinement

For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing).
Final normalization will introduce additional latency.

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis.

statusCode

StatusCode

Status messages, send by server with fixed interval (keep-alive).

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis.

classifierUpdate

RecognitionClassifierUpdate

Result of the triggered classifier

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis.

speakerAnalysis

SpeakerAnalysis

Speech statistics for every speaker

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis.

conversationAnalysis

ConversationAnalysis

Conversation statistics

Includes only one of the fields partial, final, eouUpdate, finalRefinement, statusCode, classifierUpdate, speakerAnalysis, conversationAnalysis.

channelTag

string

Tag for distinguish audio channels.

SessionUuid

Session identifier.

Field

Description

uuid

string

Internal session identifier.

userRequestId

string

User session identifier.

AudioCursors

AudioCursors are state of ASR recognition stream.

Field

Description

receivedDataMs

string (int64)

Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server.

resetTimeMs

string (int64)

Input stream reset data.

partialTimeMs

string (int64)

How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data
to update recognition results (includes silence as well).

finalTimeMs

string (int64)

Time of last final. This cursor is moved when server decides that recognition from start of audio until final_time_ms will not change anymore
usually this even is followed by EOU detection (but this could change in future).

finalIndex

string (int64)

This is index of last final server send. Incremented after each new final.

eouTimeMs

string (int64)

Estimated time of EOU. Cursor is updated after each new EOU is sent.
For external classifier this equals to received_data_ms at the moment EOU event arrives.
For internal classifier this is estimation of time. The time is not exact and has the same guarantees as word timings.

AlternativeUpdate

Update of hypothesis.

Field

Description

alternatives[]

Alternative

List of hypothesis for timeframes.

channelTag

string

Alternative

Recognition of specific time frame.

Field

Description

words[]

Word

Words in time frame.

text

string

Text in time frame.

startTimeMs

string (int64)

Start of time frame.

endTimeMs

string (int64)

End of time frame.

confidence

string

The hypothesis confidence. Currently is not used.

languages[]

LanguageEstimation

Distribution over possible languages.

Word

Recognized word.

Field

Description

text

string

Word text.

startTimeMs

string (int64)

Estimation of word start time in ms.

endTimeMs

string (int64)

Estimation of word end time in ms.

LanguageEstimation

Estimation of language and its probability.

Field

Description

languageCode

string

Language code in ISO 639-1 format.

probability

string

Estimation of language probability.

EouUpdate

Update information for external End of Utterance.

Field

Description

timeMs

string (int64)

EOU estimated time.

FinalRefinement

Refinement for final hypo. For example, text normalization is refinement.

Field

Description

finalIndex

string (int64)

Index of final for which server sends additional information.

normalizedText

AlternativeUpdate

Normalized text instead of raw one.

Includes only one of the fields normalizedText.

Type of refinement.

StatusCode

Status message

Field

Description

codeType

enum (CodeType)

Code type.

  • CODE_TYPE_UNSPECIFIED
  • WORKING: All good.
  • WARNING: For example, if speech is sent not in real time or context is unknown and we've made fallback.
  • CLOSED: After session was closed.

message

string

Human readable message.

RecognitionClassifierUpdate

Field

Description

windowType

enum (WindowType)

Response window type

  • WINDOW_TYPE_UNSPECIFIED
  • LAST_UTTERANCE: The result of applying the classifier to the last utterance response
  • LAST_FINAL: The result of applying the classifier to the last final response
  • LAST_PARTIAL: The result of applying the classifier to the last partial response

startTimeMs

string (int64)

Start time of the audio segment used for classification

endTimeMs

string (int64)

End time of the audio segment used for classification

classifierResult

RecognitionClassifierResult

Result for dictionary-based classifier

RecognitionClassifierResult

Field

Description

classifier

string

Name of the triggered classifier

highlights[]

PhraseHighlight

List of highlights, i.e. parts of phrase that determine the result of the classification

labels[]

RecognitionClassifierLabel

Classifier predictions

PhraseHighlight

Field

Description

text

string

Text transcription of the highlighted audio segment

startTimeMs

string (int64)

Start time of the highlighted audio segment

endTimeMs

string (int64)

End time of the highlighted audio segment

RecognitionClassifierLabel

Field

Description

label

string

The label of the class predicted by the classifier

confidence

string

The prediction confidence

SpeakerAnalysis

Field

Description

speakerTag

string

Speaker tag

windowType

enum (WindowType)

Response window type

  • WINDOW_TYPE_UNSPECIFIED
  • TOTAL: Stats for all received audio.
  • LAST_UTTERANCE: Stats for last utterance.

speechBoundaries

AudioSegmentBoundaries

Audio segment boundaries

totalSpeechMs

string (int64)

Total speech duration

speechRatio

string

Speech ratio within audio segment

totalSilenceMs

string (int64)

Total silence duration

silenceRatio

string

Silence ratio within audio segment

wordsCount

string (int64)

Number of words in recognized speech

lettersCount

string (int64)

Number of letters in recognized speech

wordsPerSecond

DescriptiveStatistics

Descriptive statistics for words per second distribution

lettersPerSecond

DescriptiveStatistics

Descriptive statistics for letters per second distribution

wordsPerUtterance

DescriptiveStatistics

Descriptive statistics for words per utterance distribution

lettersPerUtterance

DescriptiveStatistics

Descriptive statistics for letters per utterance distribution

utteranceCount

string (int64)

Number of utterances

utteranceDurationEstimation

DescriptiveStatistics

Descriptive statistics for utterance duration distribution

AudioSegmentBoundaries

Field

Description

startTimeMs

string (int64)

Audio segment start time

endTimeMs

string (int64)

Audio segment end time

DescriptiveStatistics

Field

Description

min

string

Minimum observed value

max

string

Maximum observed value

mean

string

Estimated mean of distribution

std

string

Estimated standard deviation of distribution

quantiles[]

Quantile

List of evaluated quantiles

Quantile

Field

Description

level

string

Quantile level in range (0, 1)

value

string

Quantile value

ConversationAnalysis

Field

Description

conversationBoundaries

AudioSegmentBoundaries

Audio segment boundaries

totalSimultaneousSilenceDurationMs

string (int64)

Total simultaneous silence duration

totalSimultaneousSilenceRatio

string

Simultaneous silence ratio within audio segment

simultaneousSilenceDurationEstimation

DescriptiveStatistics

Descriptive statistics for simultaneous silence duration distribution

totalSimultaneousSpeechDurationMs

string (int64)

Total simultaneous speech duration

totalSimultaneousSpeechRatio

string

Simultaneous speech ratio within audio segment

simultaneousSpeechDurationEstimation

DescriptiveStatistics

Descriptive statistics for simultaneous speech duration distribution

speakerInterrupts[]

InterruptsEvaluation

Interrupts description for every speaker

totalSpeechDurationMs

string (int64)

Total speech duration, including both simultaneous and separate speech

totalSpeechRatio

string

Total speech ratio within audio segment

InterruptsEvaluation

Field

Description

speakerTag

string

Speaker tag

interruptsCount

string (int64)

Number of interrupts made by the speaker

interruptsDurationMs

string (int64)

Total duration of all interrupts

interrupts[]

AudioSegmentBoundaries

Boundaries for every interrupt

Was the article helpful?

Previous
RecognizeFile
Next
DeleteRecognition
© 2025 Direct Cursus Technology L.L.C.