SpeechKit Recognition API v3, gRPC: Recognizer.RecognizeStreaming
- gRPC request
- StreamingRequest
- StreamingOptions
- RecognitionModelOptions
- AudioFormatOptions
- RawAudio
- ContainerAudio
- TextNormalizationOptions
- LanguageRestrictionOptions
- EouClassifierOptions
- DefaultEouClassifier
- ExternalEouClassifier
- RecognitionClassifierOptions
- RecognitionClassifier
- SpeechAnalysisOptions
- SpeakerLabelingOptions
- SummarizationOptions
- SummarizationProperty
- JsonSchema
- AudioChunk
- SilenceChunk
- Eou
- StreamingResponse
- SessionUuid
- AudioCursors
- AlternativeUpdate
- Alternative
- Word
- LanguageEstimation
- EouUpdate
- FinalRefinement
- StatusCode
- RecognitionClassifierUpdate
- RecognitionClassifierResult
- PhraseHighlight
- RecognitionClassifierLabel
- SpeakerAnalysis
- AudioSegmentBoundaries
- DescriptiveStatistics
- Quantile
- ConversationAnalysis
- InterruptsEvaluation
- Summarization
- SummarizationPropertyResult
- ContentUsage
Performs bidirectional streaming speech recognition receiving results while sending audio.
gRPC request
rpc RecognizeStreaming (stream StreamingRequest) returns (stream StreamingResponse)
StreamingRequest
{
// Includes only one of the fields `session_options`, `chunk`, `silence_chunk`, `eou`
"session_options": {
"recognition_model": {
"model": "string",
"audio_format": {
// Includes only one of the fields `raw_audio`, `container_audio`
"raw_audio": {
"audio_encoding": "AudioEncoding",
"sample_rate_hertz": "int64",
"audio_channel_count": "int64"
},
"container_audio": {
"container_audio_type": "ContainerAudioType"
}
// end of the list of possible fields
},
"text_normalization": {
"text_normalization": "TextNormalization",
"profanity_filter": "bool",
"literature_text": "bool",
"phone_formatting_mode": "PhoneFormattingMode"
},
"language_restriction": {
"restriction_type": "LanguageRestrictionType",
"language_code": [
"string"
]
},
"audio_processing_type": "AudioProcessingType"
},
"eou_classifier": {
// Includes only one of the fields `default_classifier`, `external_classifier`
"default_classifier": {
"type": "EouSensitivity",
"max_pause_between_words_hint_ms": "int64"
},
"external_classifier": "ExternalEouClassifier"
// end of the list of possible fields
},
"recognition_classifier": {
"classifiers": [
{
"classifier": "string",
"triggers": [
"TriggerType"
]
}
]
},
"speech_analysis": {
"enable_speaker_analysis": "bool",
"enable_conversation_analysis": "bool",
"descriptive_statistics_quantiles": [
"double"
]
},
"speaker_labeling": {
"speaker_labeling": "SpeakerLabeling"
},
"summarization": {
"model_uri": "string",
"properties": [
{
"instruction": "string",
// Includes only one of the fields `json_object`, `json_schema`
"json_object": "bool",
"json_schema": {
"schema": "google.protobuf.Struct"
}
// end of the list of possible fields
}
]
}
},
"chunk": {
"data": "bytes"
},
"silence_chunk": {
"duration_ms": "int64"
},
"eou": "Eou"
// end of the list of possible fields
}
Streaming audio request.
Events are control messages from user. First message should be session options. The next messages are audio data chunks or control messages.
|
Field |
Description |
|
session_options |
Session options. Should be the first message from user. Includes only one of the fields |
|
chunk |
Chunk with audio data. Includes only one of the fields |
|
silence_chunk |
Chunk with silence. Includes only one of the fields |
|
eou |
Request to end current utterance. Works only with external EOU detector. Includes only one of the fields |
StreamingOptions
|
Field |
Description |
|
recognition_model |
Configuration for speech recognition model. |
|
eou_classifier |
Configuration for an end of utterance detection model. |
|
recognition_classifier |
Configuration for classifiers over speech recognition. |
|
speech_analysis |
Configuration for speech analysis over speech recognition. |
|
speaker_labeling |
Configuration for speaker labeling. |
|
summarization |
Summarization options. |
RecognitionModelOptions
|
Field |
Description |
|
model |
string Sets the recognition model for the cloud version of SpeechKit. |
|
audio_format |
Specified input audio. |
|
text_normalization |
Text normalization options. |
|
language_restriction |
Possible languages in audio. |
|
audio_processing_type |
enum AudioProcessingType For
|
AudioFormatOptions
Audio format options.
|
Field |
Description |
|
raw_audio |
RAW audio without container. Includes only one of the fields |
|
container_audio |
Audio is wrapped in container. Includes only one of the fields |
RawAudio
RAW Audio format spec (no container to infer type). Used in AudioFormat options.
|
Field |
Description |
|
audio_encoding |
enum AudioEncoding Type of audio encoding.
|
|
sample_rate_hertz |
int64 PCM sample rate. |
|
audio_channel_count |
int64 PCM channel count. Currently only single channel audio is supported in real-time recognition. |
ContainerAudio
Audio with fixed type in container. Used in AudioFormat options.
|
Field |
Description |
|
container_audio_type |
enum ContainerAudioType Type of audio container.
|
TextNormalizationOptions
Options for post-processing text results. The normalization levels depend on the settings and the language.
For detailed information, see documentation.
|
Field |
Description |
|
text_normalization |
enum TextNormalization
|
|
profanity_filter |
bool Profanity filter (default: false). |
|
literature_text |
bool Rewrite text in literature style (default: false). |
|
phone_formatting_mode |
enum PhoneFormattingMode Define phone formatting mode
|
LanguageRestrictionOptions
Type of restriction for the list of languages expected in the incoming audio.
|
Field |
Description |
|
restriction_type |
enum LanguageRestrictionType Language restriction type.
|
|
language_code[] |
string The list of language codes to restrict recognition in the case of an auto model. |
EouClassifierOptions
|
Field |
Description |
|
default_classifier |
Default EOU classifier provided by SpeechKit. Includes only one of the fields |
|
external_classifier |
EOU classifier enforced by external messages from user. Includes only one of the fields |
DefaultEouClassifier
|
Field |
Description |
|
type |
enum EouSensitivity EOU sensitivity.
|
|
max_pause_between_words_hint_ms |
int64 Hint for max pause between words. SpeechKit EOU detector could use this information to adjust the speed of the EOU detection. |
ExternalEouClassifier
Use EOU provided by user.
|
Field |
Description |
|
Empty |
|
RecognitionClassifierOptions
|
Field |
Description |
|
classifiers[] |
List of classifiers to use. For detailed information and usage example, see documentation. |
RecognitionClassifier
|
Field |
Description |
|
classifier |
string Classifier name |
|
triggers[] |
enum TriggerType Describes the types of responses to which the classification results will come. Classification responses will follow the responses of the specified types.
|
SpeechAnalysisOptions
|
Field |
Description |
|
enable_speaker_analysis |
bool Analyse speech for every speaker |
|
enable_conversation_analysis |
bool Analyse conversation of two speakers |
|
descriptive_statistics_quantiles[] |
double Quantile levels in range (0, 1) for descriptive statistics |
SpeakerLabelingOptions
|
Field |
Description |
|
speaker_labeling |
enum SpeakerLabeling Specifies the execution of speaker labeling.
|
SummarizationOptions
Represents transcription summarization options.
|
Field |
Description |
|
model_uri |
string The ID of the model to be used for completion generation. |
|
properties[] |
A list of suimmarizations to perform with transcription. |
SummarizationProperty
Represents summarization entry for transcription.
|
Field |
Description |
|
instruction |
string Summarization instruction for model. |
|
json_object |
bool When set to true, the model will return a valid JSON object. Includes only one of the fields Specifies the format of the model's response. |
|
json_schema |
Enforces a specific JSON structure for the model's response based on a provided schema. Includes only one of the fields Specifies the format of the model's response. |
JsonSchema
Represents the expected structure of the model's response using a JSON Schema.
|
Field |
Description |
|
schema |
The JSON Schema that the model's output must conform to. |
AudioChunk
Data chunk with audio.
|
Field |
Description |
|
data |
bytes Bytes with audio data. |
SilenceChunk
Data chunk with silence.
|
Field |
Description |
|
duration_ms |
int64 Duration of silence chunk in ms. |
Eou
Force EOU.
|
Field |
Description |
|
Empty |
|
StreamingResponse
{
"session_uuid": {
"uuid": "string",
"user_request_id": "string"
},
"audio_cursors": {
"received_data_ms": "int64",
"reset_time_ms": "int64",
"partial_time_ms": "int64",
"final_time_ms": "int64",
"final_index": "int64",
"eou_time_ms": "int64"
},
"response_wall_time_ms": "int64",
// Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`, `summarization`
"partial": {
"alternatives": [
{
"words": [
{
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64"
}
],
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64",
"confidence": "double",
"languages": [
{
"language_code": "string",
"probability": "double"
}
]
}
],
"channel_tag": "string"
},
"final": {
"alternatives": [
{
"words": [
{
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64"
}
],
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64",
"confidence": "double",
"languages": [
{
"language_code": "string",
"probability": "double"
}
]
}
],
"channel_tag": "string"
},
"eou_update": {
"time_ms": "int64"
},
"final_refinement": {
"final_index": "int64",
// Includes only one of the fields `normalized_text`
"normalized_text": {
"alternatives": [
{
"words": [
{
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64"
}
],
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64",
"confidence": "double",
"languages": [
{
"language_code": "string",
"probability": "double"
}
]
}
],
"channel_tag": "string"
}
// end of the list of possible fields
},
"status_code": {
"code_type": "CodeType",
"message": "string"
},
"classifier_update": {
"window_type": "WindowType",
"start_time_ms": "int64",
"end_time_ms": "int64",
"classifier_result": {
"classifier": "string",
"highlights": [
{
"text": "string",
"start_time_ms": "int64",
"end_time_ms": "int64"
}
],
"labels": [
{
"label": "string",
"confidence": "double"
}
]
}
},
"speaker_analysis": {
"speaker_tag": "string",
"window_type": "WindowType",
"speech_boundaries": {
"start_time_ms": "int64",
"end_time_ms": "int64"
},
"total_speech_ms": "int64",
"speech_ratio": "double",
"total_silence_ms": "int64",
"silence_ratio": "double",
"words_count": "int64",
"letters_count": "int64",
"words_per_second": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"letters_per_second": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"words_per_utterance": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"letters_per_utterance": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"utterance_count": "int64",
"utterance_duration_estimation": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
}
},
"conversation_analysis": {
"conversation_boundaries": {
"start_time_ms": "int64",
"end_time_ms": "int64"
},
"total_simultaneous_silence_duration_ms": "int64",
"total_simultaneous_silence_ratio": "double",
"simultaneous_silence_duration_estimation": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"total_simultaneous_speech_duration_ms": "int64",
"total_simultaneous_speech_ratio": "double",
"simultaneous_speech_duration_estimation": {
"min": "double",
"max": "double",
"mean": "double",
"std": "double",
"quantiles": [
{
"level": "double",
"value": "double"
}
]
},
"speaker_interrupts": [
{
"speaker_tag": "string",
"interrupts_count": "int64",
"interrupts_duration_ms": "int64",
"interrupts": [
{
"start_time_ms": "int64",
"end_time_ms": "int64"
}
]
}
],
"total_speech_duration_ms": "int64",
"total_speech_ratio": "double"
},
"summarization": {
"results": [
{
"response": "string"
}
],
"content_usage": {
"input_text_tokens": "int64",
"completion_tokens": "int64",
"total_tokens": "int64"
}
},
// end of the list of possible fields
"channel_tag": "string"
}
Responses from server.
Each response contains session UUID, AudioCursors, and specific event.
|
Field |
Description |
|
session_uuid |
Session identifier. |
|
audio_cursors |
Progress bar for stream session recognition: how many data we obtained; final and partial times; etc. |
|
response_wall_time_ms |
int64 Wall clock on server side. This is time when server wrote results to stream. |
|
partial |
Partial results, server will send them regularly after enough audio data was received from user. Includes only one of the fields |
|
final |
Final results, the recognition is now fixed until Includes only one of the fields |
|
eou_update |
After EOU classifier, send the message with final, send the EouUpdate with time of EOU Includes only one of the fields |
|
final_refinement |
For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Includes only one of the fields |
|
status_code |
Status messages, send by server with fixed interval (keep-alive). Includes only one of the fields |
|
classifier_update |
Result of the triggered classifier. Includes only one of the fields |
|
speaker_analysis |
Speech statistics for every speaker. Includes only one of the fields |
|
conversation_analysis |
Conversation statistics. Includes only one of the fields |
|
summarization |
Summary. Includes only one of the fields |
|
channel_tag |
string Tag for distinguish audio channels. |
SessionUuid
Session identifier.
|
Field |
Description |
|
uuid |
string Internal session identifier. |
|
user_request_id |
string User session identifier. |
AudioCursors
AudioCursors are state of ASR recognition stream.
|
Field |
Description |
|
received_data_ms |
int64 Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server. |
|
reset_time_ms |
int64 Input stream reset data. |
|
partial_time_ms |
int64 How much audio was processed. This time includes trimming silences as well. |
|
final_time_ms |
int64 Time of last final. This cursor is moved when server decides that recognition from start of audio until |
|
final_index |
int64 This is index of last final server send. Incremented after each new final. |
|
eou_time_ms |
int64 Estimated time of EOU. Cursor is updated after each new EOU is sent. |
AlternativeUpdate
Update of hypothesis.
|
Field |
Description |
|
alternatives[] |
List of hypothesis for timeframes. |
|
channel_tag |
string |
Alternative
Recognition of specific time frame.
|
Field |
Description |
|
words[] |
Words in time frame. |
|
text |
string Text in time frame. |
|
start_time_ms |
int64 Start of time frame. |
|
end_time_ms |
int64 End of time frame. |
|
confidence |
double The hypothesis confidence. Currently is not used. |
|
languages[] |
Distribution over possible languages. |
Word
Recognized word.
|
Field |
Description |
|
text |
string Word text. |
|
start_time_ms |
int64 Estimation of word start time in ms. |
|
end_time_ms |
int64 Estimation of word end time in ms. |
LanguageEstimation
Estimation of language and its probability.
|
Field |
Description |
|
language_code |
string Language tag in IETF BCP 47 format, consisting of ISO 639-1 language code and ISO 3166-1 country code (e.g., en-US, ru-RU). |
|
probability |
double Estimation of language probability. |
EouUpdate
Update information for external End of Utterance.
|
Field |
Description |
|
time_ms |
int64 EOU estimated time. |
FinalRefinement
Refinement for final hypo. For example, text normalization is refinement.
|
Field |
Description |
|
final_index |
int64 Index of final for which server sends additional information. |
|
normalized_text |
Normalized text instead of raw one. Includes only one of the fields Type of refinement. |
StatusCode
Status message.
|
Field |
Description |
|
code_type |
enum CodeType Code type.
|
|
message |
string Human readable message. |
RecognitionClassifierUpdate
|
Field |
Description |
|
window_type |
enum WindowType Response window type.
|
|
start_time_ms |
int64 Start time of the audio segment used for classification. |
|
end_time_ms |
int64 End time of the audio segment used for classification. |
|
classifier_result |
Result for dictionary-based classifier. |
RecognitionClassifierResult
|
Field |
Description |
|
classifier |
string Name of the triggered classifier. |
|
highlights[] |
List of highlights, i.e. parts of phrase that determine the result of the classification. |
|
labels[] |
Classifier predictions. |
PhraseHighlight
|
Field |
Description |
|
text |
string Text transcription of the highlighted audio segment. |
|
start_time_ms |
int64 Start time of the highlighted audio segment. |
|
end_time_ms |
int64 End time of the highlighted audio segment. |
RecognitionClassifierLabel
|
Field |
Description |
|
label |
string The label of the class predicted by the classifier. |
|
confidence |
double The prediction confidence. |
SpeakerAnalysis
|
Field |
Description |
|
speaker_tag |
string Speaker tag. |
|
window_type |
enum WindowType Response window type.
|
|
speech_boundaries |
Audio segment boundaries. |
|
total_speech_ms |
int64 Total speech duration. |
|
speech_ratio |
double Speech ratio within audio segment. |
|
total_silence_ms |
int64 Total duration of silence. |
|
silence_ratio |
double Silence ratio within audio segment. |
|
words_count |
int64 Number of words in recognized speech. |
|
letters_count |
int64 Number of letters in recognized speech. |
|
words_per_second |
Descriptive statistics for words per second distribution. |
|
letters_per_second |
Descriptive statistics for letters per second distribution. |
|
words_per_utterance |
Descriptive statistics for words per utterance distribution. |
|
letters_per_utterance |
Descriptive statistics for letters per utterance distribution. |
|
utterance_count |
int64 Number of utterances |
|
utterance_duration_estimation |
Descriptive statistics for utterance duration distribution |
AudioSegmentBoundaries
|
Field |
Description |
|
start_time_ms |
int64 Audio segment start time. |
|
end_time_ms |
int64 Audio segment end time. |
DescriptiveStatistics
|
Field |
Description |
|
min |
double Minimum observed value. |
|
max |
double Maximum observed value. |
|
mean |
double Estimated mean of distribution. |
|
std |
double Estimated standard deviation of distribution. |
|
quantiles[] |
List of evaluated quantiles. |
Quantile
|
Field |
Description |
|
level |
double Quantile level in range (0, 1). |
|
value |
double Quantile value. |
ConversationAnalysis
|
Field |
Description |
|
conversation_boundaries |
Audio segment boundaries. |
|
total_simultaneous_silence_duration_ms |
int64 Total simultaneous silence duration. |
|
total_simultaneous_silence_ratio |
double Simultaneous silence ratio within audio segment. |
|
simultaneous_silence_duration_estimation |
Descriptive statistics for simultaneous silence duration distribution. |
|
total_simultaneous_speech_duration_ms |
int64 Total simultaneous speech duration. |
|
total_simultaneous_speech_ratio |
double Simultaneous speech ratio within audio segment. |
|
simultaneous_speech_duration_estimation |
Descriptive statistics for simultaneous speech duration distribution. |
|
speaker_interrupts[] |
Interrupts description for every speaker. |
|
total_speech_duration_ms |
int64 Total speech duration, including both simultaneous and separate speech. |
|
total_speech_ratio |
double Total speech ratio within audio segment. |
InterruptsEvaluation
|
Field |
Description |
|
speaker_tag |
string Speaker tag. |
|
interrupts_count |
int64 Number of interrupts made by the speaker. |
|
interrupts_duration_ms |
int64 Total duration of all interrupts. |
|
interrupts[] |
Boundaries for every interrupt. |
Summarization
|
Field |
Description |
|
results[] |
A list of summarizations of transcription. |
|
content_usage |
A set of statistics describing the number of content tokens used by the completion model. |
SummarizationPropertyResult
Represents summarization response entry for transcription.
|
Field |
Description |
|
response |
string Summarization response text. |
ContentUsage
An object representing the number of content tokens used by the completion model.
|
Field |
Description |
|
input_text_tokens |
int64 The number of tokens in the textual part of the model input. |
|
completion_tokens |
int64 The number of tokens in the generated completion. |
|
total_tokens |
int64 The total number of tokens, including all input tokens and all generated tokens. |