SpeechKit Recognition API v3, REST: AsyncRecognizer.GetRecognition
- HTTP request
- Query parameters
- Response
- SessionUuid
- AudioCursors
- AlternativeUpdate
- Alternative
- Word
- LanguageEstimation
- EouUpdate
- FinalRefinement
- StatusCode
- RecognitionClassifierUpdate
- RecognitionClassifierResult
- PhraseHighlight
- RecognitionClassifierLabel
- SpeakerAnalysis
- AudioSegmentBoundaries
- DescriptiveStatistics
- Quantile
- ConversationAnalysis
- InterruptsEvaluation
- Summarization
- SummarizationPropertyResult
- ContentUsage
Gets results of asynchronous recognition after finishing the operation.
HTTP request
GET https://stt.api.ml.yandexcloud.kz/stt/v3/getRecognition
Query parameters
|
Field |
Description |
|
operationId |
string |
Response
HTTP Code: 200 - OK
{
"sessionUuid": {
"uuid": "string",
"userRequestId": "string"
},
"audioCursors": {
"receivedDataMs": "string",
"resetTimeMs": "string",
"partialTimeMs": "string",
"finalTimeMs": "string",
"finalIndex": "string",
"eouTimeMs": "string"
},
"responseWallTimeMs": "string",
// Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`, `summarization`
"partial": {
"alternatives": [
{
"words": [
{
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string"
}
],
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string",
"confidence": "string",
"languages": [
{
"languageCode": "string",
"probability": "string"
}
]
}
],
"channelTag": "string"
},
"final": {
"alternatives": [
{
"words": [
{
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string"
}
],
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string",
"confidence": "string",
"languages": [
{
"languageCode": "string",
"probability": "string"
}
]
}
],
"channelTag": "string"
},
"eouUpdate": {
"timeMs": "string"
},
"finalRefinement": {
"finalIndex": "string",
// Includes only one of the fields `normalizedText`
"normalizedText": {
"alternatives": [
{
"words": [
{
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string"
}
],
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string",
"confidence": "string",
"languages": [
{
"languageCode": "string",
"probability": "string"
}
]
}
],
"channelTag": "string"
}
// end of the list of possible fields
},
"statusCode": {
"codeType": "string",
"message": "string"
},
"classifierUpdate": {
"windowType": "string",
"startTimeMs": "string",
"endTimeMs": "string",
"classifierResult": {
"classifier": "string",
"highlights": [
{
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string"
}
],
"labels": [
{
"label": "string",
"confidence": "string"
}
]
}
},
"speakerAnalysis": {
"speakerTag": "string",
"windowType": "string",
"speechBoundaries": {
"startTimeMs": "string",
"endTimeMs": "string"
},
"totalSpeechMs": "string",
"speechRatio": "string",
"totalSilenceMs": "string",
"silenceRatio": "string",
"wordsCount": "string",
"lettersCount": "string",
"wordsPerSecond": {
"min": "string",
"max": "string",
"mean": "string",
"std": "string",
"quantiles": [
{
"level": "string",
"value": "string"
}
]
},
"lettersPerSecond": {
"min": "string",
"max": "string",
"mean": "string",
"std": "string",
"quantiles": [
{
"level": "string",
"value": "string"
}
]
},
"wordsPerUtterance": {
"min": "string",
"max": "string",
"mean": "string",
"std": "string",
"quantiles": [
{
"level": "string",
"value": "string"
}
]
},
"lettersPerUtterance": {
"min": "string",
"max": "string",
"mean": "string",
"std": "string",
"quantiles": [
{
"level": "string",
"value": "string"
}
]
},
"utteranceCount": "string",
"utteranceDurationEstimation": {
"min": "string",
"max": "string",
"mean": "string",
"std": "string",
"quantiles": [
{
"level": "string",
"value": "string"
}
]
}
},
"conversationAnalysis": {
"conversationBoundaries": {
"startTimeMs": "string",
"endTimeMs": "string"
},
"totalSimultaneousSilenceDurationMs": "string",
"totalSimultaneousSilenceRatio": "string",
"simultaneousSilenceDurationEstimation": {
"min": "string",
"max": "string",
"mean": "string",
"std": "string",
"quantiles": [
{
"level": "string",
"value": "string"
}
]
},
"totalSimultaneousSpeechDurationMs": "string",
"totalSimultaneousSpeechRatio": "string",
"simultaneousSpeechDurationEstimation": {
"min": "string",
"max": "string",
"mean": "string",
"std": "string",
"quantiles": [
{
"level": "string",
"value": "string"
}
]
},
"speakerInterrupts": [
{
"speakerTag": "string",
"interruptsCount": "string",
"interruptsDurationMs": "string",
"interrupts": [
{
"startTimeMs": "string",
"endTimeMs": "string"
}
]
}
],
"totalSpeechDurationMs": "string",
"totalSpeechRatio": "string"
},
"summarization": {
"results": [
{
"response": "string"
}
],
"contentUsage": {
"inputTextTokens": "string",
"completionTokens": "string",
"totalTokens": "string"
}
},
// end of the list of possible fields
"channelTag": "string"
}
Responses from server.
Each response contains session UUID, AudioCursors, and specific event.
|
Field |
Description |
|
sessionUuid |
Session identifier. |
|
audioCursors |
Progress bar for stream session recognition: how many data we obtained; final and partial times; etc. |
|
responseWallTimeMs |
string (int64) Wall clock on server side. This is time when server wrote results to stream. |
|
partial |
Partial results, server will send them regularly after enough audio data was received from user. Includes only one of the fields |
|
final |
Final results, the recognition is now fixed until Includes only one of the fields |
|
eouUpdate |
After EOU classifier, send the message with final, send the EouUpdate with time of EOU Includes only one of the fields |
|
finalRefinement |
For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Includes only one of the fields |
|
statusCode |
Status messages, send by server with fixed interval (keep-alive). Includes only one of the fields |
|
classifierUpdate |
Result of the triggered classifier. Includes only one of the fields |
|
speakerAnalysis |
Speech statistics for every speaker. Includes only one of the fields |
|
conversationAnalysis |
Conversation statistics. Includes only one of the fields |
|
summarization |
Summary. Includes only one of the fields |
|
channelTag |
string Tag for distinguish audio channels. |
SessionUuid
Session identifier.
|
Field |
Description |
|
uuid |
string Internal session identifier. |
|
userRequestId |
string User session identifier. |
AudioCursors
AudioCursors are state of ASR recognition stream.
|
Field |
Description |
|
receivedDataMs |
string (int64) Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server. |
|
resetTimeMs |
string (int64) Input stream reset data. |
|
partialTimeMs |
string (int64) How much audio was processed. This time includes trimming silences as well. |
|
finalTimeMs |
string (int64) Time of last final. This cursor is moved when server decides that recognition from start of audio until |
|
finalIndex |
string (int64) This is index of last final server send. Incremented after each new final. |
|
eouTimeMs |
string (int64) Estimated time of EOU. Cursor is updated after each new EOU is sent. |
AlternativeUpdate
Update of hypothesis.
|
Field |
Description |
|
alternatives[] |
List of hypothesis for timeframes. |
|
channelTag |
string |
Alternative
Recognition of specific time frame.
|
Field |
Description |
|
words[] |
Words in time frame. |
|
text |
string Text in time frame. |
|
startTimeMs |
string (int64) Start of time frame. |
|
endTimeMs |
string (int64) End of time frame. |
|
confidence |
string The hypothesis confidence. Currently is not used. |
|
languages[] |
Distribution over possible languages. |
Word
Recognized word.
|
Field |
Description |
|
text |
string Word text. |
|
startTimeMs |
string (int64) Estimation of word start time in ms. |
|
endTimeMs |
string (int64) Estimation of word end time in ms. |
LanguageEstimation
Estimation of language and its probability.
|
Field |
Description |
|
languageCode |
string Language tag in IETF BCP 47 format, consisting of ISO 639-1 language code and ISO 3166-1 country code (e.g., en-US, ru-RU). |
|
probability |
string Estimation of language probability. |
EouUpdate
Update information for external End of Utterance.
|
Field |
Description |
|
timeMs |
string (int64) EOU estimated time. |
FinalRefinement
Refinement for final hypo. For example, text normalization is refinement.
|
Field |
Description |
|
finalIndex |
string (int64) Index of final for which server sends additional information. |
|
normalizedText |
Normalized text instead of raw one. Includes only one of the fields Type of refinement. |
StatusCode
Status message.
|
Field |
Description |
|
codeType |
enum (CodeType) Code type.
|
|
message |
string Human readable message. |
RecognitionClassifierUpdate
|
Field |
Description |
|
windowType |
enum (WindowType) Response window type.
|
|
startTimeMs |
string (int64) Start time of the audio segment used for classification. |
|
endTimeMs |
string (int64) End time of the audio segment used for classification. |
|
classifierResult |
Result for dictionary-based classifier. |
RecognitionClassifierResult
|
Field |
Description |
|
classifier |
string Name of the triggered classifier. |
|
highlights[] |
List of highlights, i.e. parts of phrase that determine the result of the classification. |
|
labels[] |
Classifier predictions. |
PhraseHighlight
|
Field |
Description |
|
text |
string Text transcription of the highlighted audio segment. |
|
startTimeMs |
string (int64) Start time of the highlighted audio segment. |
|
endTimeMs |
string (int64) End time of the highlighted audio segment. |
RecognitionClassifierLabel
|
Field |
Description |
|
label |
string The label of the class predicted by the classifier. |
|
confidence |
string The prediction confidence. |
SpeakerAnalysis
|
Field |
Description |
|
speakerTag |
string Speaker tag. |
|
windowType |
enum (WindowType) Response window type.
|
|
speechBoundaries |
Audio segment boundaries. |
|
totalSpeechMs |
string (int64) Total speech duration. |
|
speechRatio |
string Speech ratio within audio segment. |
|
totalSilenceMs |
string (int64) Total duration of silence. |
|
silenceRatio |
string Silence ratio within audio segment. |
|
wordsCount |
string (int64) Number of words in recognized speech. |
|
lettersCount |
string (int64) Number of letters in recognized speech. |
|
wordsPerSecond |
Descriptive statistics for words per second distribution. |
|
lettersPerSecond |
Descriptive statistics for letters per second distribution. |
|
wordsPerUtterance |
Descriptive statistics for words per utterance distribution. |
|
lettersPerUtterance |
Descriptive statistics for letters per utterance distribution. |
|
utteranceCount |
string (int64) Number of utterances |
|
utteranceDurationEstimation |
Descriptive statistics for utterance duration distribution |
AudioSegmentBoundaries
|
Field |
Description |
|
startTimeMs |
string (int64) Audio segment start time. |
|
endTimeMs |
string (int64) Audio segment end time. |
DescriptiveStatistics
|
Field |
Description |
|
min |
string Minimum observed value. |
|
max |
string Maximum observed value. |
|
mean |
string Estimated mean of distribution. |
|
std |
string Estimated standard deviation of distribution. |
|
quantiles[] |
List of evaluated quantiles. |
Quantile
|
Field |
Description |
|
level |
string Quantile level in range (0, 1). |
|
value |
string Quantile value. |
ConversationAnalysis
|
Field |
Description |
|
conversationBoundaries |
Audio segment boundaries. |
|
totalSimultaneousSilenceDurationMs |
string (int64) Total simultaneous silence duration. |
|
totalSimultaneousSilenceRatio |
string Simultaneous silence ratio within audio segment. |
|
simultaneousSilenceDurationEstimation |
Descriptive statistics for simultaneous silence duration distribution. |
|
totalSimultaneousSpeechDurationMs |
string (int64) Total simultaneous speech duration. |
|
totalSimultaneousSpeechRatio |
string Simultaneous speech ratio within audio segment. |
|
simultaneousSpeechDurationEstimation |
Descriptive statistics for simultaneous speech duration distribution. |
|
speakerInterrupts[] |
Interrupts description for every speaker. |
|
totalSpeechDurationMs |
string (int64) Total speech duration, including both simultaneous and separate speech. |
|
totalSpeechRatio |
string Total speech ratio within audio segment. |
InterruptsEvaluation
|
Field |
Description |
|
speakerTag |
string Speaker tag. |
|
interruptsCount |
string (int64) Number of interrupts made by the speaker. |
|
interruptsDurationMs |
string (int64) Total duration of all interrupts. |
|
interrupts[] |
Boundaries for every interrupt. |
Summarization
|
Field |
Description |
|
results[] |
A list of summarizations of transcription. |
|
contentUsage |
A set of statistics describing the number of content tokens used by the completion model. |
SummarizationPropertyResult
Represents summarization response entry for transcription.
|
Field |
Description |
|
response |
string Summarization response text. |
ContentUsage
An object representing the number of content tokens used by the completion model.
|
Field |
Description |
|
inputTextTokens |
string (int64) The number of tokens in the textual part of the model input. |
|
completionTokens |
string (int64) The number of tokens in the generated completion. |
|
totalTokens |
string (int64) The total number of tokens, including all input tokens and all generated tokens. |