SpeechKit Recognition API v3, gRPC: Recognizer.RecognizeStreaming
- gRPC request
- StreamingRequest
- StreamingOptions
- RecognitionModelOptions
- AudioFormatOptions
- RawAudio
- ContainerAudio
- TextNormalizationOptions
- LanguageRestrictionOptions
- EouClassifierOptions
- DefaultEouClassifier
- ExternalEouClassifier
- RecognitionClassifierOptions
- RecognitionClassifier
- SpeechAnalysisOptions
- SpeakerLabelingOptions
- AudioChunk
- SilenceChunk
- Eou
- StreamingResponse
- SessionUuid
- AudioCursors
- AlternativeUpdate
- Alternative
- Word
- LanguageEstimation
- EouUpdate
- FinalRefinement
- StatusCode
- RecognitionClassifierUpdate
- RecognitionClassifierResult
- PhraseHighlight
- RecognitionClassifierLabel
- SpeakerAnalysis
- AudioSegmentBoundaries
- DescriptiveStatistics
- Quantile
- ConversationAnalysis
- InterruptsEvaluation
Expects audio in real-time
gRPC request
rpc RecognizeStreaming (stream StreamingRequest) returns (stream StreamingResponse)
StreamingRequest
{
// Includes only one of the fields `sessionOptions`, `chunk`, `silenceChunk`, `eou`
"sessionOptions": {
"recognitionModel": {
"model": "string",
"audioFormat": {
// Includes only one of the fields `rawAudio`, `containerAudio`
"rawAudio": {
"audioEncoding": "AudioEncoding",
"sampleRateHertz": "int64",
"audioChannelCount": "int64"
},
"containerAudio": {
"containerAudioType": "ContainerAudioType"
}
// end of the list of possible fields
},
"textNormalization": {
"textNormalization": "TextNormalization",
"profanityFilter": "bool",
"literatureText": "bool",
"phoneFormattingMode": "PhoneFormattingMode"
},
"languageRestriction": {
"restrictionType": "LanguageRestrictionType",
"languageCode": [
"string"
]
},
"audioProcessingType": "AudioProcessingType"
},
"eouClassifier": {
// Includes only one of the fields `defaultClassifier`, `externalClassifier`
"defaultClassifier": {
"type": "EouSensitivity",
"maxPauseBetweenWordsHintMs": "int64"
},
"externalClassifier": "ExternalEouClassifier"
// end of the list of possible fields
},
"recognitionClassifier": {
"classifiers": [
{
"classifier": "string",
"triggers": [
"TriggerType"
]
}
]
},
"speechAnalysis": {
"enableSpeakerAnalysis": "bool",
"enableConversationAnalysis": "bool",
"descriptiveStatisticsQuantiles": [
"double"
]
},
"speakerLabeling": {
"speakerLabeling": "SpeakerLabeling"
}
},
"chunk": {
"data": "bytes"
},
"silenceChunk": {
"durationMs": "int64"
},
"eou": "Eou"
// end of the list of possible fields
}
Streaming audio request
Events are control messages from user.
First message should be session options.
The next messages are audio data chunks or control messages.
Field |
Description |
sessionOptions |
Session options. Should be the first message from user. Includes only one of the fields |
chunk |
Chunk with audio data. Includes only one of the fields |
silenceChunk |
Chunk with silence. Includes only one of the fields |
eou |
Request to end current utterance. Works only with external EOU detector. Includes only one of the fields |
StreamingOptions
Field |
Description |
recognitionModel |
Configuration for speech recognition model. |
eouClassifier |
Configuration for end of utterance detection model. |
recognitionClassifier |
Configuration for classifiers over speech recognition. |
speechAnalysis |
Configuration for speech analysis over speech recognition. |
speakerLabeling |
Configuration for speaker labeling |
RecognitionModelOptions
Field |
Description |
model |
string Sets the recognition model for the cloud version of SpeechKit. Possible values: 'general', 'general:rc', 'general:deprecated'. |
audioFormat |
Specified input audio. |
textNormalization |
Text normalization options. |
languageRestriction |
Possible languages in audio. |
audioProcessingType |
enum AudioProcessingType How to deal with audio data (in real time, after all data is received, etc). Default is REAL_TIME.
|
AudioFormatOptions
Audio format options.
Field |
Description |
rawAudio |
Audio without container. Includes only one of the fields |
containerAudio |
Audio is wrapped in container. Includes only one of the fields |
RawAudio
RAW Audio format spec (no container to infer type). Used in AudioFormat options.
Field |
Description |
audioEncoding |
enum AudioEncoding Type of audio encoding
|
sampleRateHertz |
int64 PCM sample rate |
audioChannelCount |
int64 PCM channel count. Currently only single channel audio is supported in real-time recognition. |
ContainerAudio
Audio with fixed type in container. Used in AudioFormat options.
Field |
Description |
containerAudioType |
enum ContainerAudioType Type of audio container.
|
TextNormalizationOptions
Options
Field |
Description |
textNormalization |
enum TextNormalization
|
profanityFilter |
bool Profanity filter (default: false). |
literatureText |
bool Rewrite text in literature style (default: false). |
phoneFormattingMode |
enum PhoneFormattingMode Define phone formatting mode
|
LanguageRestrictionOptions
Type of restriction for the list of languages expected in the incoming speech stream.
Field |
Description |
restrictionType |
enum LanguageRestrictionType Language restriction type
|
languageCode[] |
string The list of language codes to restrict recognition in the case of an auto model |
EouClassifierOptions
Field |
Description |
defaultClassifier |
EOU classifier provided by SpeechKit. Default. Includes only one of the fields Type of EOU classifier. |
externalClassifier |
EOU is enforced by external messages from user. Includes only one of the fields Type of EOU classifier. |
DefaultEouClassifier
Field |
Description |
type |
enum EouSensitivity EOU sensitivity. Currently two levels, faster with more error and more conservative (our default).
|
maxPauseBetweenWordsHintMs |
int64 Hint for max pause between words. Our EOU detector could use this information to distinguish between end of utterance and slow speech (like one |
ExternalEouClassifier
Use EOU provided by user
Field |
Description |
Empty |
RecognitionClassifierOptions
Field |
Description |
classifiers[] |
List of classifiers to use |
RecognitionClassifier
Field |
Description |
classifier |
string Classifier name |
triggers[] |
enum TriggerType Describes the types of responses to which the classification results will come
|
SpeechAnalysisOptions
Field |
Description |
enableSpeakerAnalysis |
bool Analyse speech for every speaker |
enableConversationAnalysis |
bool Analyse conversation of two speakers |
descriptiveStatisticsQuantiles[] |
double Quantile levels in range (0, 1) for descriptive statistics |
SpeakerLabelingOptions
Field |
Description |
speakerLabeling |
enum SpeakerLabeling Specifies the execution of speaker labeling. Default is SPEAKER_LABELING_DISABLED.
|
AudioChunk
Data chunk with audio.
Field |
Description |
data |
bytes Bytes with audio data. |
SilenceChunk
Data chunk with silence.
Field |
Description |
durationMs |
int64 Duration of silence chunk in ms. |
Eou
Force EOU
Field |
Description |
Empty |
StreamingResponse
{
"sessionUuid": {
"uuid": "string",
"userRequestId": "string"
},
"audioCursors": {
"receivedDataMs": "int64",
"resetTimeMs": "int64",
"partialTimeMs": "int64",
"finalTimeMs": "int64",
"finalIndex": "int64",
"eouTimeMs": "int64"
},
"responseWallTimeMs": "int64",
// Includes only one of the fields `partial`, `final`, `eouUpdate`, `finalRefinement`, `statusCode`, `classifierUpdate`, `speakerAnalysis`, `conversationAnalysis`
"partial": {
"alternatives": [
{
"words": [
{
"text": "string",
"startTimeMs": "int64",
"endTimeMs": "int64"
}
],
"text": "string",
"startTimeMs": "int64",
"endTimeMs": "int64",
"confidence": "double",
"languages": [
{
"languageCode": "string",
"probability": "double"
}
]
}
],
"channelTag": "string"
},
"final": {
"alternatives": [
{
"words": [
{
"text": "string",
"startTimeMs": "int64",
"endTimeMs": "int64"
}
],
"text": "string",
"startTimeMs": "int64",
"endTimeMs": "int64",
"confidence": "double",
"languages": [
{
"languageCode": "string",
"probability": "double"
}
]
}
],
"channelTag": "string"
},
"eouUpdate": {
"timeMs": "int64"
},
"finalRefinement": {
"finalIndex": "int64",
// Includes only one of the fields `normalizedText`
"normalizedText": {
"alternatives": [
{
"words": [
{
"text": "string",
"startTimeMs": "int64",
"endTimeMs": "int64"
}
],
"text": "string",
"startTimeMs": "int64",
"endTimeMs": "int64",
"confidence": "double",
"languages": [
{
"languageCode": "string",
"probability": "double"
}
]
}
],
"channelTag": "string"
}
// end of the list of possible fields
},
"statusCode": {
"codeType": "CodeType",
"message": "string"
},
"classifierUpdate": {
"windowType": "WindowType",
"startTimeMs": "int64",
"endTimeMs": "int64",
"classifierResult": {
"classifier": "string",
"highlights": [
{
"text": "string",
"startTimeMs": "int64",
"endTimeMs": "int64"
}
],
"labels": [
{
"label": "string",
"confidence": "double"
}
]
}
},
"speakerAnalysis": {
"speakerTag": "string",
"windowType": "WindowType",
"speechBoundaries": {
"startTimeMs": "int64",
"endTimeMs": "int64"
},
"totalSpeechMs": "int64",
"speechRatio": "double",
"totalSilenceMs": "int64",
"silenceRatio": "double",
"wordsCount": "int64",
"lettersCount": "int64",
"wordsPerSecond": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"lettersPerSecond": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"wordsPerUtterance": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"lettersPerUtterance": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"utteranceCount": "int64",
"utteranceDurationEstimation": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
}
},
"conversationAnalysis": {
"conversationBoundaries": {
"startTimeMs": "int64",
"endTimeMs": "int64"
},
"totalSimultaneousSilenceDurationMs": "int64",
"totalSimultaneousSilenceRatio": "double",
"simultaneousSilenceDurationEstimation": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"totalSimultaneousSpeechDurationMs": "int64",
"totalSimultaneousSpeechRatio": "double",
"simultaneousSpeechDurationEstimation": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"speakerInterrupts": [
{
"speakerTag": "string",
"interruptsCount": "int64",
"interruptsDurationMs": "int64",
"interrupts": [
{
"startTimeMs": "int64",
"endTimeMs": "int64"
}
]
}
],
"totalSpeechDurationMs": "int64",
"totalSpeechRatio": "double"
},
// end of the list of possible fields
"channelTag": "string"
}
Responses from server.
Each response contains session uuid
AudioCursors
plus specific event
Field |
Description |
sessionUuid |
Session identifier |
audioCursors |
Progress bar for stream session recognition: how many data we obtained; final and partial times; etc. |
responseWallTimeMs |
int64 Wall clock on server side. This is time when server wrote results to stream |
partial |
Partial results, server will send them regularly after enough audio data was received from user. This are current text estimation Includes only one of the fields |
final |
Final results, the recognition is now fixed until final_time_ms. For now, final is sent only if the EOU event was triggered. This could be change in future releases. Includes only one of the fields |
eouUpdate |
After EOU classifier, send the message with final, send the EouUpdate with time of EOU Includes only one of the fields |
finalRefinement |
For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Includes only one of the fields |
statusCode |
Status messages, send by server with fixed interval (keep-alive). Includes only one of the fields |
classifierUpdate |
Result of the triggered classifier Includes only one of the fields |
speakerAnalysis |
Speech statistics for every speaker Includes only one of the fields |
conversationAnalysis |
Conversation statistics Includes only one of the fields |
channelTag |
string Tag for distinguish audio channels. |
SessionUuid
Session identifier.
Field |
Description |
uuid |
string Internal session identifier. |
userRequestId |
string User session identifier. |
AudioCursors
AudioCursors are state of ASR recognition stream.
Field |
Description |
receivedDataMs |
int64 Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server. |
resetTimeMs |
int64 Input stream reset data. |
partialTimeMs |
int64 How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data |
finalTimeMs |
int64 Time of last final. This cursor is moved when server decides that recognition from start of audio until final_time_ms will not change anymore |
finalIndex |
int64 This is index of last final server send. Incremented after each new final. |
eouTimeMs |
int64 Estimated time of EOU. Cursor is updated after each new EOU is sent. |
AlternativeUpdate
Update of hypothesis.
Field |
Description |
alternatives[] |
List of hypothesis for timeframes. |
channelTag |
string |
Alternative
Recognition of specific time frame.
Field |
Description |
words[] |
Words in time frame. |
text |
string Text in time frame. |
startTimeMs |
int64 Start of time frame. |
endTimeMs |
int64 End of time frame. |
confidence |
double The hypothesis confidence. Currently is not used. |
languages[] |
Distribution over possible languages. |
Word
Recognized word.
Field |
Description |
text |
string Word text. |
startTimeMs |
int64 Estimation of word start time in ms. |
endTimeMs |
int64 Estimation of word end time in ms. |
LanguageEstimation
Estimation of language and its probability.
Field |
Description |
languageCode |
string Language code in ISO 639-1 format. |
probability |
double Estimation of language probability. |
EouUpdate
Update information for external End of Utterance.
Field |
Description |
timeMs |
int64 EOU estimated time. |
FinalRefinement
Refinement for final hypo. For example, text normalization is refinement.
Field |
Description |
finalIndex |
int64 Index of final for which server sends additional information. |
normalizedText |
Normalized text instead of raw one. Includes only one of the fields Type of refinement. |
StatusCode
Status message
Field |
Description |
codeType |
enum CodeType Code type.
|
message |
string Human readable message. |
RecognitionClassifierUpdate
Field |
Description |
windowType |
enum WindowType Response window type
|
startTimeMs |
int64 Start time of the audio segment used for classification |
endTimeMs |
int64 End time of the audio segment used for classification |
classifierResult |
Result for dictionary-based classifier |
RecognitionClassifierResult
Field |
Description |
classifier |
string Name of the triggered classifier |
highlights[] |
List of highlights, i.e. parts of phrase that determine the result of the classification |
labels[] |
Classifier predictions |
PhraseHighlight
Field |
Description |
text |
string Text transcription of the highlighted audio segment |
startTimeMs |
int64 Start time of the highlighted audio segment |
endTimeMs |
int64 End time of the highlighted audio segment |
RecognitionClassifierLabel
Field |
Description |
label |
string The label of the class predicted by the classifier |
confidence |
double The prediction confidence |
SpeakerAnalysis
Field |
Description |
speakerTag |
string Speaker tag |
windowType |
enum WindowType Response window type
|
speechBoundaries |
Audio segment boundaries |
totalSpeechMs |
int64 Total speech duration |
speechRatio |
double Speech ratio within audio segment |
totalSilenceMs |
int64 Total silence duration |
silenceRatio |
double Silence ratio within audio segment |
wordsCount |
int64 Number of words in recognized speech |
lettersCount |
int64 Number of letters in recognized speech |
wordsPerSecond |
Descriptive statistics for words per second distribution |
lettersPerSecond |
Descriptive statistics for letters per second distribution |
wordsPerUtterance |
Descriptive statistics for words per utterance distribution |
lettersPerUtterance |
Descriptive statistics for letters per utterance distribution |
utteranceCount |
int64 Number of utterances |
utteranceDurationEstimation |
Descriptive statistics for utterance duration distribution |
AudioSegmentBoundaries
Field |
Description |
startTimeMs |
int64 Audio segment start time |
endTimeMs |
int64 Audio segment end time |
DescriptiveStatistics
Field |
Description |
min |
double Minimum observed value |
max |
double Maximum observed value |
mean |
double Estimated mean of distribution |
std |
double Estimated standard deviation of distribution |
quantiles[] |
List of evaluated quantiles |
Quantile
Field |
Description |
level |
double Quantile level in range (0, 1) |
value |
double Quantile value |
ConversationAnalysis
Field |
Description |
conversationBoundaries |
Audio segment boundaries |
totalSimultaneousSilenceDurationMs |
int64 Total simultaneous silence duration |
totalSimultaneousSilenceRatio |
double Simultaneous silence ratio within audio segment |
simultaneousSilenceDurationEstimation |
Descriptive statistics for simultaneous silence duration distribution |
totalSimultaneousSpeechDurationMs |
int64 Total simultaneous speech duration |
totalSimultaneousSpeechRatio |
double Simultaneous speech ratio within audio segment |
simultaneousSpeechDurationEstimation |
Descriptive statistics for simultaneous speech duration distribution |
speakerInterrupts[] |
Interrupts description for every speaker |
totalSpeechDurationMs |
int64 Total speech duration, including both simultaneous and separate speech |
totalSpeechRatio |
double Total speech ratio within audio segment |
InterruptsEvaluation
Field |
Description |
speakerTag |
string Speaker tag |
interruptsCount |
int64 Number of interrupts made by the speaker |
interruptsDurationMs |
int64 Total duration of all interrupts |
interrupts[] |
Boundaries for every interrupt |