SpeechKit Recognition API v3, gRPC: AsyncRecognizer.GetRecognition
- gRPC request
- GetRecognitionRequest
- StreamingResponse
- SessionUuid
- AudioCursors
- AlternativeUpdate
- Alternative
- Word
- LanguageEstimation
- EouUpdate
- FinalRefinement
- StatusCode
- RecognitionClassifierUpdate
- RecognitionClassifierResult
- PhraseHighlight
- RecognitionClassifierLabel
- SpeakerAnalysis
- AudioSegmentBoundaries
- DescriptiveStatistics
- Quantile
- ConversationAnalysis
- InterruptsEvaluation
gRPC request
rpc GetRecognition (GetRecognitionRequest) returns (stream StreamingResponse)
GetRecognitionRequest
{
"operation_id": "string"
}
Field |
Description |
operation_id |
string |
StreamingResponse
{
"session_uuid": {
"uuid": "string",
"user_request_id": "string"
},
"audio_cursors": {
"received_data_ms": "int64",
"reset_time_ms": "int64",
"partial_time_ms": "int64",
"final_time_ms": "int64",
"final_index": "int64",
"eou_time_ms": "int64"
},
"response_wall_time_ms": "int64",
// Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`
"partial": {
"alternatives": [
{
"words": [
{
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64"
}
],
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64",
"confidence": "double",
"languages": [
{
"language_code": "string",
"probability": "double"
}
]
}
],
"channel_tag": "string"
},
"final": {
"alternatives": [
{
"words": [
{
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64"
}
],
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64",
"confidence": "double",
"languages": [
{
"language_code": "string",
"probability": "double"
}
]
}
],
"channel_tag": "string"
},
"eou_update": {
"time_ms": "int64"
},
"final_refinement": {
"final_index": "int64",
// Includes only one of the fields `normalized_text`
"normalized_text": {
"alternatives": [
{
"words": [
{
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64"
}
],
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64",
"confidence": "double",
"languages": [
{
"language_code": "string",
"probability": "double"
}
]
}
],
"channel_tag": "string"
}
// end of the list of possible fields
},
"status_code": {
"code_type": "CodeType",
"message": "string"
},
"classifier_update": {
"window_type": "WindowType",
"start_time_ms": "int64",
"end_time_ms": "int64",
"classifier_result": {
"classifier": "string",
"highlights": [
{
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64"
}
],
"labels": [
{
"label": "string",
"confidence": "double"
}
]
}
},
"speaker_analysis": {
"speaker_tag": "string",
"window_type": "WindowType",
"speech_boundaries": {
"start_time_ms": "int64",
"end_time_ms": "int64"
},
"total_speech_ms": "int64",
"speech_ratio": "double",
"total_silence_ms": "int64",
"silence_ratio": "double",
"words_count": "int64",
"letters_count": "int64",
"words_per_second": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"letters_per_second": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"words_per_utterance": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"letters_per_utterance": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"utterance_count": "int64",
"utterance_duration_estimation": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
}
},
"conversation_analysis": {
"conversation_boundaries": {
"start_time_ms": "int64",
"end_time_ms": "int64"
},
"total_simultaneous_silence_duration_ms": "int64",
"total_simultaneous_silence_ratio": "double",
"simultaneous_silence_duration_estimation": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"total_simultaneous_speech_duration_ms": "int64",
"total_simultaneous_speech_ratio": "double",
"simultaneous_speech_duration_estimation": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"speaker_interrupts": [
{
"speaker_tag": "string",
"interrupts_count": "int64",
"interrupts_duration_ms": "int64",
"interrupts": [
{
"start_time_ms": "int64",
"end_time_ms": "int64"
}
]
}
],
"total_speech_duration_ms": "int64",
"total_speech_ratio": "double"
},
// end of the list of possible fields
"channel_tag": "string"
}
Responses from server.
Each response contains session uuid
AudioCursors
plus specific event
Field |
Description |
session_uuid |
Session identifier |
audio_cursors |
Progress bar for stream session recognition: how many data we obtained; final and partial times; etc. |
response_wall_time_ms |
int64 Wall clock on server side. This is time when server wrote results to stream |
partial |
Partial results, server will send them regularly after enough audio data was received from user. This are current text estimation Includes only one of the fields |
final |
Final results, the recognition is now fixed until final_time_ms. For now, final is sent only if the EOU event was triggered. This could be change in future releases. Includes only one of the fields |
eou_update |
After EOU classifier, send the message with final, send the EouUpdate with time of EOU Includes only one of the fields |
final_refinement |
For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Includes only one of the fields |
status_code |
Status messages, send by server with fixed interval (keep-alive). Includes only one of the fields |
classifier_update |
Result of the triggered classifier Includes only one of the fields |
speaker_analysis |
Speech statistics for every speaker Includes only one of the fields |
conversation_analysis |
Conversation statistics Includes only one of the fields |
channel_tag |
string Tag for distinguish audio channels. |
SessionUuid
Session identifier.
Field |
Description |
uuid |
string Internal session identifier. |
user_request_id |
string User session identifier. |
AudioCursors
AudioCursors are state of ASR recognition stream.
Field |
Description |
received_data_ms |
int64 Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server. |
reset_time_ms |
int64 Input stream reset data. |
partial_time_ms |
int64 How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data |
final_time_ms |
int64 Time of last final. This cursor is moved when server decides that recognition from start of audio until final_time_ms will not change anymore |
final_index |
int64 This is index of last final server send. Incremented after each new final. |
eou_time_ms |
int64 Estimated time of EOU. Cursor is updated after each new EOU is sent. |
AlternativeUpdate
Update of hypothesis.
Field |
Description |
alternatives[] |
List of hypothesis for timeframes. |
channel_tag |
string |
Alternative
Recognition of specific time frame.
Field |
Description |
words[] |
Words in time frame. |
text |
string Text in time frame. |
start_time_ms |
int64 Start of time frame. |
end_time_ms |
int64 End of time frame. |
confidence |
double The hypothesis confidence. Currently is not used. |
languages[] |
Distribution over possible languages. |
Word
Recognized word.
Field |
Description |
text |
string Word text. |
start_time_ms |
int64 Estimation of word start time in ms. |
end_time_ms |
int64 Estimation of word end time in ms. |
LanguageEstimation
Estimation of language and its probability.
Field |
Description |
language_code |
string Language code in ISO 639-1 format. |
probability |
double Estimation of language probability. |
EouUpdate
Update information for external End of Utterance.
Field |
Description |
time_ms |
int64 EOU estimated time. |
FinalRefinement
Refinement for final hypo. For example, text normalization is refinement.
Field |
Description |
final_index |
int64 Index of final for which server sends additional information. |
normalized_text |
Normalized text instead of raw one. Includes only one of the fields Type of refinement. |
StatusCode
Status message
Field |
Description |
code_type |
enum CodeType Code type.
|
message |
string Human readable message. |
RecognitionClassifierUpdate
Field |
Description |
window_type |
enum WindowType Response window type
|
start_time_ms |
int64 Start time of the audio segment used for classification |
end_time_ms |
int64 End time of the audio segment used for classification |
classifier_result |
Result for dictionary-based classifier |
RecognitionClassifierResult
Field |
Description |
classifier |
string Name of the triggered classifier |
highlights[] |
List of highlights, i.e. parts of phrase that determine the result of the classification |
labels[] |
Classifier predictions |
PhraseHighlight
Field |
Description |
text |
string Text transcription of the highlighted audio segment |
start_time_ms |
int64 Start time of the highlighted audio segment |
end_time_ms |
int64 End time of the highlighted audio segment |
RecognitionClassifierLabel
Field |
Description |
label |
string The label of the class predicted by the classifier |
confidence |
double The prediction confidence |
SpeakerAnalysis
Field |
Description |
speaker_tag |
string Speaker tag |
window_type |
enum WindowType Response window type
|
speech_boundaries |
Audio segment boundaries |
total_speech_ms |
int64 Total speech duration |
speech_ratio |
double Speech ratio within audio segment |
total_silence_ms |
int64 Total silence duration |
silence_ratio |
double Silence ratio within audio segment |
words_count |
int64 Number of words in recognized speech |
letters_count |
int64 Number of letters in recognized speech |
words_per_second |
Descriptive statistics for words per second distribution |
letters_per_second |
Descriptive statistics for letters per second distribution |
words_per_utterance |
Descriptive statistics for words per utterance distribution |
letters_per_utterance |
Descriptive statistics for letters per utterance distribution |
utterance_count |
int64 Number of utterances |
utterance_duration_estimation |
Descriptive statistics for utterance duration distribution |
AudioSegmentBoundaries
Field |
Description |
start_time_ms |
int64 Audio segment start time |
end_time_ms |
int64 Audio segment end time |
DescriptiveStatistics
Field |
Description |
min |
double Minimum observed value |
max |
double Maximum observed value |
mean |
double Estimated mean of distribution |
std |
double Estimated standard deviation of distribution |
quantiles[] |
List of evaluated quantiles |
Quantile
Field |
Description |
level |
double Quantile level in range (0, 1) |
value |
double Quantile value |
ConversationAnalysis
Field |
Description |
conversation_boundaries |
Audio segment boundaries |
total_simultaneous_silence_duration_ms |
int64 Total simultaneous silence duration |
total_simultaneous_silence_ratio |
double Simultaneous silence ratio within audio segment |
simultaneous_silence_duration_estimation |
Descriptive statistics for simultaneous silence duration distribution |
total_simultaneous_speech_duration_ms |
int64 Total simultaneous speech duration |
total_simultaneous_speech_ratio |
double Simultaneous speech ratio within audio segment |
simultaneous_speech_duration_estimation |
Descriptive statistics for simultaneous speech duration distribution |
speaker_interrupts[] |
Interrupts description for every speaker |
total_speech_duration_ms |
int64 Total speech duration, including both simultaneous and separate speech |
total_speech_ratio |
double Total speech ratio within audio segment |
InterruptsEvaluation
Field |
Description |
speaker_tag |
string Speaker tag |
interrupts_count |
int64 Number of interrupts made by the speaker |
interrupts_duration_ms |
int64 Total duration of all interrupts |
interrupts[] |
Boundaries for every interrupt |