SpeechKit Recognition API v3, REST: AsyncRecognizer.getRecognition
HTTP request
GET https://stt.api.cloud.yandex.net/stt/v3/getRecognition
Query parameters
Parameter | Description |
---|---|
operationId | Required. The maximum string length in characters is 50. |
Response
HTTP Code: 200 - OK
{
"sessionUuid": {
"uuid": "string",
"userRequestId": "string"
},
"audioCursors": {
"receivedDataMs": "string",
"resetTimeMs": "string",
"partialTimeMs": "string",
"finalTimeMs": "string",
"finalIndex": "string",
"eouTimeMs": "string"
},
"responseWallTimeMs": "string",
"channelTag": "string",
// includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`
"partial": {
"alternatives": [
{
"words": [
{
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string"
}
],
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string",
"confidence": "number",
"languages": [
{
"languageCode": "string",
"probability": "number"
}
]
}
],
"channelTag": "string"
},
"final": {
"alternatives": [
{
"words": [
{
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string"
}
],
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string",
"confidence": "number",
"languages": [
{
"languageCode": "string",
"probability": "number"
}
]
}
],
"channelTag": "string"
},
"eouUpdate": {
"timeMs": "string"
},
"finalRefinement": {
"finalIndex": "string",
"normalizedText": {
"alternatives": [
{
"words": [
{
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string"
}
],
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string",
"confidence": "number",
"languages": [
{
"languageCode": "string",
"probability": "number"
}
]
}
],
"channelTag": "string"
}
},
"statusCode": {
"codeType": "string",
"message": "string"
},
"classifierUpdate": {
"windowType": "string",
"startTimeMs": "string",
"endTimeMs": "string",
"classifierResult": {
"classifier": "string",
"highlights": [
{
"text": "string",
"startTimeMs": "string",
"endTimeMs": "string"
}
],
"labels": [
{
"label": "string",
"confidence": "number"
}
]
}
},
"speakerAnalysis": {
"speakerTag": "string",
"windowType": "string",
"speechBoundaries": {
"startTimeMs": "string",
"endTimeMs": "string"
},
"totalSpeechMs": "string",
"speechRatio": "number",
"totalSilenceMs": "string",
"silenceRatio": "number",
"wordsCount": "string",
"lettersCount": "string",
"wordsPerSecond": {
"min": "number",
"max": "number",
"mean": "number",
"std": "number",
"quantiles": [
{
"level": "number",
"value": "number"
}
]
},
"lettersPerSecond": {
"min": "number",
"max": "number",
"mean": "number",
"std": "number",
"quantiles": [
{
"level": "number",
"value": "number"
}
]
},
"wordsPerUtterance": {
"min": "number",
"max": "number",
"mean": "number",
"std": "number",
"quantiles": [
{
"level": "number",
"value": "number"
}
]
},
"lettersPerUtterance": {
"min": "number",
"max": "number",
"mean": "number",
"std": "number",
"quantiles": [
{
"level": "number",
"value": "number"
}
]
},
"utteranceCount": "string",
"utteranceDurationEstimation": {
"min": "number",
"max": "number",
"mean": "number",
"std": "number",
"quantiles": [
{
"level": "number",
"value": "number"
}
]
}
},
"conversationAnalysis": {
"conversationBoundaries": {
"startTimeMs": "string",
"endTimeMs": "string"
},
"totalSimultaneousSilenceDurationMs": "string",
"totalSimultaneousSilenceRatio": "number",
"simultaneousSilenceDurationEstimation": {
"min": "number",
"max": "number",
"mean": "number",
"std": "number",
"quantiles": [
{
"level": "number",
"value": "number"
}
]
},
"totalSimultaneousSpeechDurationMs": "string",
"totalSimultaneousSpeechRatio": "number",
"simultaneousSpeechDurationEstimation": {
"min": "number",
"max": "number",
"mean": "number",
"std": "number",
"quantiles": [
{
"level": "number",
"value": "number"
}
]
},
"speakerInterrupts": [
{
"speakerTag": "string",
"interruptsCount": "string",
"interruptsDurationMs": "string",
"interrupts": [
{
"startTimeMs": "string",
"endTimeMs": "string"
}
]
}
],
"totalSpeechDurationMs": "string",
"totalSpeechRatio": "number"
},
// end of the list of possible fields
}
Responses from server.
Each response contains session uuid
AudioCursors
plus specific event
Field | Description |
---|---|
sessionUuid | object Session identifier |
sessionUuid. uuid |
string Internal session identifier. |
sessionUuid. userRequestId |
string User session identifier. |
audioCursors | object Progress bar for stream session recognition: how many data we obtained; final and partial times; etc. |
audioCursors. receivedDataMs |
string (int64) Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server. |
audioCursors. resetTimeMs |
string (int64) Input stream reset data. |
audioCursors. partialTimeMs |
string (int64) How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data to update recognition results (includes silence as well). |
audioCursors. finalTimeMs |
string (int64) Time of last final. This cursor is moved when server decides that recognition from start of audio until final_time_ms will not change anymore usually this even is followed by EOU detection (but this could change in future). |
audioCursors. finalIndex |
string (int64) This is index of last final server send. Incremented after each new final. |
audioCursors. eouTimeMs |
string (int64) Estimated time of EOU. Cursor is updated after each new EOU is sent. For external classifier this equals to received_data_ms at the moment EOU event arrives. For internal classifier this is estimation of time. The time is not exact and has the same guarantees as word timings. |
responseWallTimeMs | string (int64) Wall clock on server side. This is time when server wrote results to stream |
channelTag | string Tag for distinguish audio channels. |
partial | object Partial results, server will send them regularly after enough audio data was received from user. This are current text estimation from final_time_ms to partial_time_ms. Could change after new data will arrive. includes only one of the fields partial , final , eouUpdate , finalRefinement , statusCode , classifierUpdate , speakerAnalysis , conversationAnalysis |
partial. alternatives[] |
object List of hypothesis for timeframes. |
partial. alternatives[]. words[] |
object Words in time frame. |
partial. alternatives[]. words[]. text |
string Word text. |
partial. alternatives[]. words[]. startTimeMs |
string (int64) Estimation of word start time in ms. |
partial. alternatives[]. words[]. endTimeMs |
string (int64) Estimation of word end time in ms. |
partial. alternatives[]. text |
string Text in time frame. |
partial. alternatives[]. startTimeMs |
string (int64) Start of time frame. |
partial. alternatives[]. endTimeMs |
string (int64) End of time frame. |
partial. alternatives[]. confidence |
number (double) The hypothesis confidence. Currently is not used. |
partial. alternatives[]. languages[] |
object Distribution over possible languages. |
partial. alternatives[]. languages[]. languageCode |
string Language code in ISO 639-1 format. |
partial. alternatives[]. languages[]. probability |
number (double) Estimation of language probability. |
partial. channelTag |
string |
final | object Final results, the recognition is now fixed until final_time_ms. For now, final is sent only if the EOU event was triggered. This could be change in future releases. includes only one of the fields partial , final , eouUpdate , finalRefinement , statusCode , classifierUpdate , speakerAnalysis , conversationAnalysis |
final. alternatives[] |
object List of hypothesis for timeframes. |
final. alternatives[]. words[] |
object Words in time frame. |
final. alternatives[]. words[]. text |
string Word text. |
final. alternatives[]. words[]. startTimeMs |
string (int64) Estimation of word start time in ms. |
final. alternatives[]. words[]. endTimeMs |
string (int64) Estimation of word end time in ms. |
final. alternatives[]. text |
string Text in time frame. |
final. alternatives[]. startTimeMs |
string (int64) Start of time frame. |
final. alternatives[]. endTimeMs |
string (int64) End of time frame. |
final. alternatives[]. confidence |
number (double) The hypothesis confidence. Currently is not used. |
final. alternatives[]. languages[] |
object Distribution over possible languages. |
final. alternatives[]. languages[]. languageCode |
string Language code in ISO 639-1 format. |
final. alternatives[]. languages[]. probability |
number (double) Estimation of language probability. |
final. channelTag |
string |
eouUpdate | object After EOU classifier, send the message with final, send the EouUpdate with time of EOU before eou_update we send final with the same time. there could be several finals before eou update. includes only one of the fields partial , final , eouUpdate , finalRefinement , statusCode , classifierUpdate , speakerAnalysis , conversationAnalysis |
eouUpdate. timeMs |
string (int64) EOU estimated time. |
finalRefinement | object For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Final normalization will introduce additional latency. includes only one of the fields partial , final , eouUpdate , finalRefinement , statusCode , classifierUpdate , speakerAnalysis , conversationAnalysis |
finalRefinement. finalIndex |
string (int64) Index of final for which server sends additional information. |
finalRefinement. normalizedText |
object Normalized text instead of raw one. |
finalRefinement. normalizedText. alternatives[] |
object List of hypothesis for timeframes. |
finalRefinement. normalizedText. alternatives[]. words[] |
object Words in time frame. |
finalRefinement. normalizedText. alternatives[]. words[]. text |
string Word text. |
finalRefinement. normalizedText. alternatives[]. words[]. startTimeMs |
string (int64) Estimation of word start time in ms. |
finalRefinement. normalizedText. alternatives[]. words[]. endTimeMs |
string (int64) Estimation of word end time in ms. |
finalRefinement. normalizedText. alternatives[]. text |
string Text in time frame. |
finalRefinement. normalizedText. alternatives[]. startTimeMs |
string (int64) Start of time frame. |
finalRefinement. normalizedText. alternatives[]. endTimeMs |
string (int64) End of time frame. |
finalRefinement. normalizedText. alternatives[]. confidence |
number (double) The hypothesis confidence. Currently is not used. |
finalRefinement. normalizedText. alternatives[]. languages[] |
object Distribution over possible languages. |
finalRefinement. normalizedText. alternatives[]. languages[]. languageCode |
string Language code in ISO 639-1 format. |
finalRefinement. normalizedText. alternatives[]. languages[]. probability |
number (double) Estimation of language probability. |
finalRefinement. normalizedText. channelTag |
string |
statusCode | object Status messages, send by server with fixed interval (keep-alive). includes only one of the fields partial , final , eouUpdate , finalRefinement , statusCode , classifierUpdate , speakerAnalysis , conversationAnalysis |
statusCode. codeType |
string Code type.
|
statusCode. message |
string Human readable message. |
classifierUpdate | object Result of the triggered classifier includes only one of the fields partial , final , eouUpdate , finalRefinement , statusCode , classifierUpdate , speakerAnalysis , conversationAnalysis |
classifierUpdate. windowType |
string Response window type
|
classifierUpdate. startTimeMs |
string (int64) Start time of the audio segment used for classification |
classifierUpdate. endTimeMs |
string (int64) End time of the audio segment used for classification |
classifierUpdate. classifierResult |
object Result for dictionary-based classifier |
classifierUpdate. classifierResult. classifier |
string Name of the triggered classifier |
classifierUpdate. classifierResult. highlights[] |
object List of highlights, i.e. parts of phrase that determine the result of the classification |
classifierUpdate. classifierResult. highlights[]. text |
string Text transcription of the highlighted audio segment |
classifierUpdate. classifierResult. highlights[]. startTimeMs |
string (int64) Start time of the highlighted audio segment |
classifierUpdate. classifierResult. highlights[]. endTimeMs |
string (int64) End time of the highlighted audio segment |
classifierUpdate. classifierResult. labels[] |
object Classifier predictions |
classifierUpdate. classifierResult. labels[]. label |
string The label of the class predicted by the classifier |
classifierUpdate. classifierResult. labels[]. confidence |
number (double) The prediction confidence |
speakerAnalysis | object Speech statistics for every speaker includes only one of the fields partial , final , eouUpdate , finalRefinement , statusCode , classifierUpdate , speakerAnalysis , conversationAnalysis |
speakerAnalysis. speakerTag |
string Speaker tag |
speakerAnalysis. windowType |
string Response window type
|
speakerAnalysis. speechBoundaries |
object Audio segment boundaries |
speakerAnalysis. speechBoundaries. startTimeMs |
string (int64) Audio segment start time |
speakerAnalysis. speechBoundaries. endTimeMs |
string (int64) Audio segment end time |
speakerAnalysis. totalSpeechMs |
string (int64) Total speech duration |
speakerAnalysis. speechRatio |
number (double) Speech ratio within audio segment |
speakerAnalysis. totalSilenceMs |
string (int64) Total silence duration |
speakerAnalysis. silenceRatio |
number (double) Silence ratio within audio segment |
speakerAnalysis. wordsCount |
string (int64) Number of words in recognized speech |
speakerAnalysis. lettersCount |
string (int64) Number of letters in recognized speech |
speakerAnalysis. wordsPerSecond |
object Descriptive statistics for words per second distribution |
speakerAnalysis. wordsPerSecond. min |
number (double) Minimum observed value |
speakerAnalysis. wordsPerSecond. max |
number (double) Maximum observed value |
speakerAnalysis. wordsPerSecond. mean |
number (double) Estimated mean of distribution |
speakerAnalysis. wordsPerSecond. std |
number (double) Estimated standard deviation of distribution |
speakerAnalysis. wordsPerSecond. quantiles[] |
object List of evaluated quantiles |
speakerAnalysis. wordsPerSecond. quantiles[]. level |
number (double) Quantile level in range (0, 1) |
speakerAnalysis. wordsPerSecond. quantiles[]. value |
number (double) Quantile value |
speakerAnalysis. lettersPerSecond |
object Descriptive statistics for letters per second distribution |
speakerAnalysis. lettersPerSecond. min |
number (double) Minimum observed value |
speakerAnalysis. lettersPerSecond. max |
number (double) Maximum observed value |
speakerAnalysis. lettersPerSecond. mean |
number (double) Estimated mean of distribution |
speakerAnalysis. lettersPerSecond. std |
number (double) Estimated standard deviation of distribution |
speakerAnalysis. lettersPerSecond. quantiles[] |
object List of evaluated quantiles |
speakerAnalysis. lettersPerSecond. quantiles[]. level |
number (double) Quantile level in range (0, 1) |
speakerAnalysis. lettersPerSecond. quantiles[]. value |
number (double) Quantile value |
speakerAnalysis. wordsPerUtterance |
object Descriptive statistics for words per utterance distribution |
speakerAnalysis. wordsPerUtterance. min |
number (double) Minimum observed value |
speakerAnalysis. wordsPerUtterance. max |
number (double) Maximum observed value |
speakerAnalysis. wordsPerUtterance. mean |
number (double) Estimated mean of distribution |
speakerAnalysis. wordsPerUtterance. std |
number (double) Estimated standard deviation of distribution |
speakerAnalysis. wordsPerUtterance. quantiles[] |
object List of evaluated quantiles |
speakerAnalysis. wordsPerUtterance. quantiles[]. level |
number (double) Quantile level in range (0, 1) |
speakerAnalysis. wordsPerUtterance. quantiles[]. value |
number (double) Quantile value |
speakerAnalysis. lettersPerUtterance |
object Descriptive statistics for letters per utterance distribution |
speakerAnalysis. lettersPerUtterance. min |
number (double) Minimum observed value |
speakerAnalysis. lettersPerUtterance. max |
number (double) Maximum observed value |
speakerAnalysis. lettersPerUtterance. mean |
number (double) Estimated mean of distribution |
speakerAnalysis. lettersPerUtterance. std |
number (double) Estimated standard deviation of distribution |
speakerAnalysis. lettersPerUtterance. quantiles[] |
object List of evaluated quantiles |
speakerAnalysis. lettersPerUtterance. quantiles[]. level |
number (double) Quantile level in range (0, 1) |
speakerAnalysis. lettersPerUtterance. quantiles[]. value |
number (double) Quantile value |
speakerAnalysis. utteranceCount |
string (int64) Number of utterances |
speakerAnalysis. utteranceDurationEstimation |
object Descriptive statistics for utterance duration distribution |
speakerAnalysis. utteranceDurationEstimation. min |
number (double) Minimum observed value |
speakerAnalysis. utteranceDurationEstimation. max |
number (double) Maximum observed value |
speakerAnalysis. utteranceDurationEstimation. mean |
number (double) Estimated mean of distribution |
speakerAnalysis. utteranceDurationEstimation. std |
number (double) Estimated standard deviation of distribution |
speakerAnalysis. utteranceDurationEstimation. quantiles[] |
object List of evaluated quantiles |
speakerAnalysis. utteranceDurationEstimation. quantiles[]. level |
number (double) Quantile level in range (0, 1) |
speakerAnalysis. utteranceDurationEstimation. quantiles[]. value |
number (double) Quantile value |
conversationAnalysis | object Conversation statistics includes only one of the fields partial , final , eouUpdate , finalRefinement , statusCode , classifierUpdate , speakerAnalysis , conversationAnalysis |
conversationAnalysis. conversationBoundaries |
object Audio segment boundaries |
conversationAnalysis. conversationBoundaries. startTimeMs |
string (int64) Audio segment start time |
conversationAnalysis. conversationBoundaries. endTimeMs |
string (int64) Audio segment end time |
conversationAnalysis. totalSimultaneousSilenceDurationMs |
string (int64) Total simultaneous silence duration |
conversationAnalysis. totalSimultaneousSilenceRatio |
number (double) Simultaneous silence ratio within audio segment |
conversationAnalysis. simultaneousSilenceDurationEstimation |
object Descriptive statistics for simultaneous silence duration distribution |
conversationAnalysis. simultaneousSilenceDurationEstimation. min |
number (double) Minimum observed value |
conversationAnalysis. simultaneousSilenceDurationEstimation. max |
number (double) Maximum observed value |
conversationAnalysis. simultaneousSilenceDurationEstimation. mean |
number (double) Estimated mean of distribution |
conversationAnalysis. simultaneousSilenceDurationEstimation. std |
number (double) Estimated standard deviation of distribution |
conversationAnalysis. simultaneousSilenceDurationEstimation. quantiles[] |
object List of evaluated quantiles |
conversationAnalysis. simultaneousSilenceDurationEstimation. quantiles[]. level |
number (double) Quantile level in range (0, 1) |
conversationAnalysis. simultaneousSilenceDurationEstimation. quantiles[]. value |
number (double) Quantile value |
conversationAnalysis. totalSimultaneousSpeechDurationMs |
string (int64) Total simultaneous speech duration |
conversationAnalysis. totalSimultaneousSpeechRatio |
number (double) Simultaneous speech ratio within audio segment |
conversationAnalysis. simultaneousSpeechDurationEstimation |
object Descriptive statistics for simultaneous speech duration distribution |
conversationAnalysis. simultaneousSpeechDurationEstimation. min |
number (double) Minimum observed value |
conversationAnalysis. simultaneousSpeechDurationEstimation. max |
number (double) Maximum observed value |
conversationAnalysis. simultaneousSpeechDurationEstimation. mean |
number (double) Estimated mean of distribution |
conversationAnalysis. simultaneousSpeechDurationEstimation. std |
number (double) Estimated standard deviation of distribution |
conversationAnalysis. simultaneousSpeechDurationEstimation. quantiles[] |
object List of evaluated quantiles |
conversationAnalysis. simultaneousSpeechDurationEstimation. quantiles[]. level |
number (double) Quantile level in range (0, 1) |
conversationAnalysis. simultaneousSpeechDurationEstimation. quantiles[]. value |
number (double) Quantile value |
conversationAnalysis. speakerInterrupts[] |
object Interrupts description for every speaker |
conversationAnalysis. speakerInterrupts[]. speakerTag |
string Speaker tag |
conversationAnalysis. speakerInterrupts[]. interruptsCount |
string (int64) Number of interrupts made by the speaker |
conversationAnalysis. speakerInterrupts[]. interruptsDurationMs |
string (int64) Total duration of all interrupts |
conversationAnalysis. speakerInterrupts[]. interrupts[] |
object Boundaries for every interrupt |
conversationAnalysis. speakerInterrupts[]. interrupts[]. startTimeMs |
string (int64) Audio segment start time |
conversationAnalysis. speakerInterrupts[]. interrupts[]. endTimeMs |
string (int64) Audio segment end time |
conversationAnalysis. totalSpeechDurationMs |
string (int64) Total speech duration, including both simultaneous and separate speech |
conversationAnalysis. totalSpeechRatio |
number (double) Total speech ratio within audio segment |