SpeechKit Recognition API v3, REST: AsyncRecognizer.GetRecognition

Статья создана

Обновлена 3 октября 2025 г.

HTTP request
Query parameters
Response
SessionUuid
AudioCursors
AlternativeUpdate
Alternative
Word
LanguageEstimation
EouUpdate
FinalRefinement
StatusCode
RecognitionClassifierUpdate
RecognitionClassifierResult
PhraseHighlight
RecognitionClassifierLabel
SpeakerAnalysis
AudioSegmentBoundaries
DescriptiveStatistics
Quantile
ConversationAnalysis
InterruptsEvaluation
Summarization
SummarizationPropertyResult
ContentUsage

Gets results of asynchronous recognition after finishing the operation.

HTTP request

GET https://stt.api.ml.yandexcloud.kz/stt/v3/getRecognition

Query parameters

Field	Description
operationId	string

Response

HTTP Code: 200 - OK

{
  "sessionUuid": {
    "uuid": "string",
    "userRequestId": "string"
  },
  "audioCursors": {
    "receivedDataMs": "string",
    "resetTimeMs": "string",
    "partialTimeMs": "string",
    "finalTimeMs": "string",
    "finalIndex": "string",
    "eouTimeMs": "string"
  },
  "responseWallTimeMs": "string",
  // Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`
  "partial": {
    "alternatives": [
      {
        "words": [
          {
            "text": "string",
            "startTimeMs": "string",
            "endTimeMs": "string"
          }
        ],
        "text": "string",
        "startTimeMs": "string",
        "endTimeMs": "string",
        "confidence": "string",
        "languages": [
          {
            "languageCode": "string",
            "probability": "string"
          }
        ]
      }
    ],
    "channelTag": "string"
  },
  "final": {
    "alternatives": [
      {
        "words": [
          {
            "text": "string",
            "startTimeMs": "string",
            "endTimeMs": "string"
          }
        ],
        "text": "string",
        "startTimeMs": "string",
        "endTimeMs": "string",
        "confidence": "string",
        "languages": [
          {
            "languageCode": "string",
            "probability": "string"
          }
        ]
      }
    ],
    "channelTag": "string"
  },
  "eouUpdate": {
    "timeMs": "string"
  },
  "finalRefinement": {
    "finalIndex": "string",
    // Includes only one of the fields `normalizedText`
    "normalizedText": {
      "alternatives": [
        {
          "words": [
            {
              "text": "string",
              "startTimeMs": "string",
              "endTimeMs": "string"
            }
          ],
          "text": "string",
          "startTimeMs": "string",
          "endTimeMs": "string",
          "confidence": "string",
          "languages": [
            {
              "languageCode": "string",
              "probability": "string"
            }
          ]
        }
      ],
      "channelTag": "string"
    }
    // end of the list of possible fields
  },
  "statusCode": {
    "codeType": "string",
    "message": "string"
  },
  "classifierUpdate": {
    "windowType": "string",
    "startTimeMs": "string",
    "endTimeMs": "string",
    "classifierResult": {
      "classifier": "string",
      "highlights": [
        {
          "text": "string",
          "startTimeMs": "string",
          "endTimeMs": "string"
        }
      ],
      "labels": [
        {
          "label": "string",
          "confidence": "string"
        }
      ]
    }
  },
  "speakerAnalysis": {
    "speakerTag": "string",
    "windowType": "string",
    "speechBoundaries": {
      "startTimeMs": "string",
      "endTimeMs": "string"
    },
    "totalSpeechMs": "string",
    "speechRatio": "string",
    "totalSilenceMs": "string",
    "silenceRatio": "string",
    "wordsCount": "string",
    "lettersCount": "string",
    "wordsPerSecond": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "lettersPerSecond": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "wordsPerUtterance": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "lettersPerUtterance": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "utteranceCount": "string",
    "utteranceDurationEstimation": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    }
  },
  "conversationAnalysis": {
    "conversationBoundaries": {
      "startTimeMs": "string",
      "endTimeMs": "string"
    },
    "totalSimultaneousSilenceDurationMs": "string",
    "totalSimultaneousSilenceRatio": "string",
    "simultaneousSilenceDurationEstimation": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "totalSimultaneousSpeechDurationMs": "string",
    "totalSimultaneousSpeechRatio": "string",
    "simultaneousSpeechDurationEstimation": {
      "min": "string",
      "max": "string",
      "mean": "string",
      "std": "string",
      "quantiles": [
        {
          "level": "string",
          "value": "string"
        }
      ]
    },
    "speakerInterrupts": [
      {
        "speakerTag": "string",
        "interruptsCount": "string",
        "interruptsDurationMs": "string",
        "interrupts": [
          {
            "startTimeMs": "string",
            "endTimeMs": "string"
          }
        ]
      }
    ],
    "totalSpeechDurationMs": "string",
    "totalSpeechRatio": "string"
  },
  "summarization": {
    "results": [
      {
        "response": "string"
      }
    ],
    "contentUsage": {
      "inputTextTokens": "string",
      "completionTokens": "string",
      "totalTokens": "string"
    }
  },
  // end of the list of possible fields
  "channelTag": "string"
}

Responses from server.
Each response contains session UUID, AudioCursors, and specific event.

Field	Description
sessionUuid	SessionUuid Session identifier.
audioCursors	AudioCursors Progress bar for stream session recognition: how many data we obtained; final and partial times; etc.
responseWallTimeMs	string (int64) Wall clock on server side. This is time when server wrote results to stream.
partial	AlternativeUpdate Partial results, server will send them regularly after enough audio data was received from user. This is the current text estimation from `final_time_ms` to `partial_time_ms`. Could change after new data will arrive. Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`.
final	AlternativeUpdate Final results, the recognition is now fixed until `final_time_ms`. For now, final is sent only if the EOU event was triggered. This behavior could be changed in future releases. Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`.
eouUpdate	EouUpdate After EOU classifier, send the message with final, send the EouUpdate with time of EOU before eou_update we send final with the same time. there could be several finals before eou update. Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`.
finalRefinement	FinalRefinement For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Final normalization will introduce additional latency. Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`.
statusCode	StatusCode Status messages, send by server with fixed interval (keep-alive). Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`.
classifierUpdate	RecognitionClassifierUpdate Result of the triggered classifier. Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`.
speakerAnalysis	SpeakerAnalysis Speech statistics for every speaker. Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`.
conversationAnalysis	ConversationAnalysis Conversation statistics. Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`.
summarization	Summarization Summary. Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`.
channelTag	string Tag for distinguish audio channels.

SessionUuid

Session identifier.

Field

Description

uuid

string

Internal session identifier.

userRequestId

string

User session identifier.

AudioCursors

AudioCursors are state of ASR recognition stream.

Field	Description
receivedDataMs	string (int64) Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server.
resetTimeMs	string (int64) Input stream reset data.
partialTimeMs	string (int64) How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data to update recognition results (includes silence as well).
finalTimeMs	string (int64) Time of last final. This cursor is moved when server decides that recognition from start of audio until `final_time_ms` will not change anymore usually this event is followed by EOU detection. This behavior could change in future.
finalIndex	string (int64) This is index of last final server send. Incremented after each new final.
eouTimeMs	string (int64) Estimated time of EOU. Cursor is updated after each new EOU is sent. For external classifier this equals to `receivedDataMs` at the moment EOU event arrives. For internal classifier this is estimation of time. The time is not exact and has the same guarantees as word timings.

AlternativeUpdate

Update of hypothesis.

Field

Description

alternatives[]

Alternative

List of hypothesis for timeframes.

channelTag

string

Alternative

Recognition of specific time frame.

Field	Description
words[]	Word Words in time frame.
text	string Text in time frame.
startTimeMs	string (int64) Start of time frame.
endTimeMs	string (int64) End of time frame.
confidence	string The hypothesis confidence. Currently is not used.
languages[]	LanguageEstimation Distribution over possible languages.

Word

Recognized word.

Field	Description
text	string Word text.
startTimeMs	string (int64) Estimation of word start time in ms.
endTimeMs	string (int64) Estimation of word end time in ms.

LanguageEstimation

Estimation of language and its probability.

Field

Description

languageCode

string

Language tag in IETF BCP 47 format, consisting of ISO 639-1 language code and ISO 3166-1 country code (e.g., en-US, ru-RU).

probability

string

Estimation of language probability.

EouUpdate

Update information for external End of Utterance.

Field

Description

timeMs

string (int64)

EOU estimated time.

FinalRefinement

Refinement for final hypo. For example, text normalization is refinement.

Field

Description

finalIndex

string (int64)

Index of final for which server sends additional information.

normalizedText

AlternativeUpdate

Normalized text instead of raw one.

Includes only one of the fields normalizedText.

Type of refinement.

StatusCode

Status message.

Field

Description

codeType

enum (CodeType)

Code type.

CODE_TYPE_UNSPECIFIED
WORKING: All good.
WARNING: For example, if speech is sent not in real-time or context is unknown and we've made fallback.
CLOSED: After session was closed.

message

string

Human readable message.

RecognitionClassifierUpdate

Field	Description
windowType	enum (WindowType) Response window type. `WINDOW_TYPE_UNSPECIFIED` `LAST_UTTERANCE`: The result of applying the classifier to the last utterance response. `LAST_FINAL`: The result of applying the classifier to the last final response. `LAST_PARTIAL`: The result of applying the classifier to the last partial response.
startTimeMs	string (int64) Start time of the audio segment used for classification.
endTimeMs	string (int64) End time of the audio segment used for classification.
classifierResult	RecognitionClassifierResult Result for dictionary-based classifier.

RecognitionClassifierResult

Field	Description
classifier	string Name of the triggered classifier.
highlights[]	PhraseHighlight List of highlights, i.e. parts of phrase that determine the result of the classification.
labels[]	RecognitionClassifierLabel Classifier predictions.

PhraseHighlight

Field	Description
text	string Text transcription of the highlighted audio segment.
startTimeMs	string (int64) Start time of the highlighted audio segment.
endTimeMs	string (int64) End time of the highlighted audio segment.

RecognitionClassifierLabel

Field

Description

label

string

The label of the class predicted by the classifier.

confidence

string

The prediction confidence.

SpeakerAnalysis

Field	Description
speakerTag	string Speaker tag.
windowType	enum (WindowType) Response window type. `WINDOW_TYPE_UNSPECIFIED` `TOTAL`: Stats for all received audio. `LAST_UTTERANCE`: Stats for last utterance.
speechBoundaries	AudioSegmentBoundaries Audio segment boundaries.
totalSpeechMs	string (int64) Total speech duration.
speechRatio	string Speech ratio within audio segment.
totalSilenceMs	string (int64) Total duration of silence.
silenceRatio	string Silence ratio within audio segment.
wordsCount	string (int64) Number of words in recognized speech.
lettersCount	string (int64) Number of letters in recognized speech.
wordsPerSecond	DescriptiveStatistics Descriptive statistics for words per second distribution.
lettersPerSecond	DescriptiveStatistics Descriptive statistics for letters per second distribution.
wordsPerUtterance	DescriptiveStatistics Descriptive statistics for words per utterance distribution.
lettersPerUtterance	DescriptiveStatistics Descriptive statistics for letters per utterance distribution.
utteranceCount	string (int64) Number of utterances
utteranceDurationEstimation	DescriptiveStatistics Descriptive statistics for utterance duration distribution

AudioSegmentBoundaries

Field

Description

startTimeMs

string (int64)

Audio segment start time.

endTimeMs

string (int64)

Audio segment end time.

DescriptiveStatistics

Field	Description
min	string Minimum observed value.
max	string Maximum observed value.
mean	string Estimated mean of distribution.
std	string Estimated standard deviation of distribution.
quantiles[]	Quantile List of evaluated quantiles.

Quantile

Field

Description

level

string

Quantile level in range (0, 1).

value

string

Quantile value.

ConversationAnalysis

Field	Description
conversationBoundaries	AudioSegmentBoundaries Audio segment boundaries.
totalSimultaneousSilenceDurationMs	string (int64) Total simultaneous silence duration.
totalSimultaneousSilenceRatio	string Simultaneous silence ratio within audio segment.
simultaneousSilenceDurationEstimation	DescriptiveStatistics Descriptive statistics for simultaneous silence duration distribution.
totalSimultaneousSpeechDurationMs	string (int64) Total simultaneous speech duration.
totalSimultaneousSpeechRatio	string Simultaneous speech ratio within audio segment.
simultaneousSpeechDurationEstimation	DescriptiveStatistics Descriptive statistics for simultaneous speech duration distribution.
speakerInterrupts[]	InterruptsEvaluation Interrupts description for every speaker.
totalSpeechDurationMs	string (int64) Total speech duration, including both simultaneous and separate speech.
totalSpeechRatio	string Total speech ratio within audio segment.

InterruptsEvaluation

Field	Description
speakerTag	string Speaker tag.
interruptsCount	string (int64) Number of interrupts made by the speaker.
interruptsDurationMs	string (int64) Total duration of all interrupts.
interrupts[]	AudioSegmentBoundaries Boundaries for every interrupt.

Summarization

Field

Description

results[]

SummarizationPropertyResult

A list of summarizations of transcription.

contentUsage

ContentUsage

A set of statistics describing the number of content tokens used by the completion model.

SummarizationPropertyResult

Represents summarization response entry for transcription.

Field

Description

response

string

Summarization response text.

ContentUsage

An object representing the number of content tokens used by the completion model.

Field	Description
inputTextTokens	string (int64) The number of tokens in the textual part of the model input.
completionTokens	string (int64) The number of tokens in the generated completion.
totalTokens	string (int64) The total number of tokens, including all input tokens and all generated tokens.

SpeechKit Recognition API v3, REST: AsyncRecognizer.GetRecognition

HTTP requestHTTP request

Query parametersQuery parameters

ResponseResponse

SessionUuidSessionUuid

AudioCursorsAudioCursors

AlternativeUpdateAlternativeUpdate

AlternativeAlternative

WordWord

LanguageEstimationLanguageEstimation

EouUpdateEouUpdate

FinalRefinementFinalRefinement

StatusCodeStatusCode

RecognitionClassifierUpdateRecognitionClassifierUpdate

RecognitionClassifierResultRecognitionClassifierResult

PhraseHighlightPhraseHighlight

RecognitionClassifierLabelRecognitionClassifierLabel

SpeakerAnalysisSpeakerAnalysis

AudioSegmentBoundariesAudioSegmentBoundaries

DescriptiveStatisticsDescriptiveStatistics

QuantileQuantile

ConversationAnalysisConversationAnalysis

InterruptsEvaluationInterruptsEvaluation

SummarizationSummarization

SummarizationPropertyResultSummarizationPropertyResult

ContentUsageContentUsage

Была ли статья полезна?

HTTP request

Query parameters

Response

SessionUuid

AudioCursors

AlternativeUpdate

Alternative

Word

LanguageEstimation

EouUpdate

FinalRefinement

StatusCode

RecognitionClassifierUpdate

RecognitionClassifierResult

PhraseHighlight

RecognitionClassifierLabel

SpeakerAnalysis

AudioSegmentBoundaries

DescriptiveStatistics

Quantile

ConversationAnalysis

InterruptsEvaluation

Summarization

SummarizationPropertyResult

ContentUsage