SpeechKit Hybrid Recognition API, gRPC: Recognizer
Written by
Updated at January 25, 2024
- Calls Recognizer
- RecognizeStreaming
- StreamingRequest
- StreamingOptions
- RecognitionModelOptions
- AudioFormatOptions
- RawAudio
- ContainerAudio
- TextNormalizationOptions
- LanguageRestrictionOptions
- EouClassifierOptions
- DefaultEouClassifier
- ExternalEouClassifier
- RecognitionClassifierOptions
- RecognitionClassifier
- SpeechAnalysisOptions
- SpeakerLabelingOptions
- AudioChunk
- SilenceChunk
- Eou
- StreamingResponse
- SessionUuid
- AudioCursors
- AlternativeUpdate
- Alternative
- Word
- LanguageEstimation
- EouUpdate
- FinalRefinement
- StatusCode
- RecognitionClassifierUpdate
- RecognitionClassifierResult
- PhraseHighlight
- RecognitionClassifierLabel
- SpeakerAnalysis
- AudioSegmentBoundaries
- DescriptiveStatistics
- Quantile
- ConversationAnalysis
- InterruptsEvaluation
A set of methods for voice recognition.
Call | Description |
---|---|
RecognizeStreaming | Expects audio in real-time |
Calls Recognizer
RecognizeStreaming
Expects audio in real-time
rpc RecognizeStreaming (stream StreamingRequest) returns (stream StreamingResponse)
StreamingRequest
Field | Description |
---|---|
Event | oneof: session_options , chunk , silence_chunk or eou |
session_options | StreamingOptions Session options. Should be the first message from user. |
chunk | AudioChunk Chunk with audio data. |
silence_chunk | SilenceChunk Chunk with silence. |
eou | Eou Request to end current utterance. Works only with external EOU detector. |
StreamingOptions
Field | Description |
---|---|
recognition_model | RecognitionModelOptions Configuration for speech recognition model. |
eou_classifier | EouClassifierOptions Configuration for end of utterance detection model. |
recognition_classifier | RecognitionClassifierOptions Configuration for classifiers over speech recognition. |
speech_analysis | SpeechAnalysisOptions Configuration for speech analysis over speech recognition. |
speaker_labeling | SpeakerLabelingOptions Configuration for speaker labeling |
RecognitionModelOptions
Field | Description |
---|---|
model | string Sets the recognition model for the cloud version of SpeechKit. Possible values: 'general', 'general:rc', 'general:deprecated'. The model is ignored for SpeechKit Hybrid. |
audio_format | AudioFormatOptions Specified input audio. |
text_normalization | TextNormalizationOptions Text normalization options. |
language_restriction | LanguageRestrictionOptions Possible languages in audio. |
audio_processing_type | enum AudioProcessingType How to deal with audio data (in real time, after all data is received, etc). Default is REAL_TIME.
|
AudioFormatOptions
Field | Description |
---|---|
AudioFormat | oneof: raw_audio or container_audio |
raw_audio | RawAudio Audio without container. |
container_audio | ContainerAudio Audio is wrapped in container. |
RawAudio
Field | Description |
---|---|
audio_encoding | enum AudioEncoding Type of audio encoding
|
sample_rate_hertz | int64 PCM sample rate |
audio_channel_count | int64 PCM channel count. Currently only single channel audio is supported in real-time recognition. |
ContainerAudio
Field | Description |
---|---|
container_audio_type | enum ContainerAudioType Type of audio container.
|
TextNormalizationOptions
Field | Description |
---|---|
text_normalization | enum TextNormalization Normalization
|
profanity_filter | bool Profanity filter (default: false). |
literature_text | bool Rewrite text in literature style (default: false). |
phone_formatting_mode | enum PhoneFormattingMode Define phone formatting mode
|
LanguageRestrictionOptions
Field | Description |
---|---|
restriction_type | enum LanguageRestrictionType Language restriction type
|
language_code[] | string The list of language codes to restrict recognition in the case of an auto model |
EouClassifierOptions
Field | Description |
---|---|
Classifier | oneof: default_classifier or external_classifier Type of EOU classifier. |
default_classifier | DefaultEouClassifier EOU classifier provided by SpeechKit. Default. |
external_classifier | ExternalEouClassifier EOU is enforced by external messages from user. |
DefaultEouClassifier
Field | Description |
---|---|
type | enum EouSensitivity EOU sensitivity. Currently two levels, faster with more error and more conservative (our default). |
max_pause_between_words_hint_ms | int64 Hint for max pause between words. Our EOU detector could use this information to distinguish between end of utterance and slow speech (like one |
ExternalEouClassifier
Empty.
RecognitionClassifierOptions
Field | Description |
---|---|
classifiers[] | RecognitionClassifier List of classifiers to use |
RecognitionClassifier
Field | Description |
---|---|
classifier | string Classifier name |
triggers[] | enum TriggerType Describes the types of responses to which the classification results will come
|
SpeechAnalysisOptions
Field | Description |
---|---|
enable_speaker_analysis | bool Analyse speech for every speaker |
enable_conversation_analysis | bool Analyse conversation of two speakers |
descriptive_statistics_quantiles[] | double Quantile levels in range (0, 1) for descriptive statistics |
SpeakerLabelingOptions
Field | Description |
---|---|
speaker_labeling | enum SpeakerLabeling Specifies the execution of speaker labeling. Default is SPEAKER_LABELING_DISABLED.
|
AudioChunk
Field | Description |
---|---|
data | bytes Bytes with audio data. |
SilenceChunk
Field | Description |
---|---|
duration_ms | int64 Duration of silence chunk in ms. |
Eou
Empty.
StreamingResponse
Field | Description |
---|---|
session_uuid | SessionUuid Session identifier |
audio_cursors | AudioCursors Progress bar for stream session recognition: how many data we obtained; final and partial times; etc. |
response_wall_time_ms | int64 Wall clock on server side. This is time when server wrote results to stream |
Event | oneof: partial , final , eou_update , final_refinement , status_code , classifier_update , speaker_analysis or conversation_analysis |
partial | AlternativeUpdate Partial results, server will send them regularly after enough audio data was received from user. This are current text estimation from final_time_ms to partial_time_ms. Could change after new data will arrive. |
final | AlternativeUpdate Final results, the recognition is now fixed until final_time_ms. For now, final is sent only if the EOU event was triggered. This could be change in future releases. |
eou_update | EouUpdate After EOU classifier, send the message with final, send the EouUpdate with time of EOU before eou_update we send final with the same time. there could be several finals before eou update. |
final_refinement | FinalRefinement For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Final normalization will introduce additional latency. |
status_code | StatusCode Status messages, send by server with fixed interval (keep-alive). |
classifier_update | RecognitionClassifierUpdate Result of the triggered classifier |
speaker_analysis | SpeakerAnalysis Speech statistics for every speaker |
conversation_analysis | ConversationAnalysis Conversation statistics |
channel_tag | string Tag for distinguish audio channels. |
SessionUuid
Field | Description |
---|---|
uuid | string Internal session identifier. |
user_request_id | string User session identifier. |
AudioCursors
Field | Description |
---|---|
received_data_ms | int64 Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server. |
reset_time_ms | int64 Input stream reset data. |
partial_time_ms | int64 How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data to update recognition results (includes silence as well). |
final_time_ms | int64 Time of last final. This cursor is moved when server decides that recognition from start of audio until final_time_ms will not change anymore usually this even is followed by EOU detection (but this could change in future). |
final_index | int64 This is index of last final server send. Incremented after each new final. |
eou_time_ms | int64 Estimated time of EOU. Cursor is updated after each new EOU is sent. For external classifier this equals to received_data_ms at the moment EOU event arrives. For internal classifier this is estimation of time. The time is not exact and has the same guarantees as word timings. |
AlternativeUpdate
Field | Description |
---|---|
alternatives[] | Alternative List of hypothesis for timeframes. |
channel_tag | string |
Alternative
Field | Description |
---|---|
words[] | Word Words in time frame. |
text | string Text in time frame. |
start_time_ms | int64 Start of time frame. |
end_time_ms | int64 End of time frame. |
confidence | double The hypothesis confidence. Currently is not used. |
languages[] | LanguageEstimation Distribution over possible languages. |
Word
Field | Description |
---|---|
text | string Word text. |
start_time_ms | int64 Estimation of word start time in ms. |
end_time_ms | int64 Estimation of word end time in ms. |
LanguageEstimation
Field | Description |
---|---|
language_code | string Language code in ISO 639-1 format. |
probability | double Estimation of language probability. |
EouUpdate
Field | Description |
---|---|
time_ms | int64 EOU estimated time. |
FinalRefinement
Field | Description |
---|---|
final_index | int64 Index of final for which server sends additional information. |
Type | oneof: normalized_text Type of refinement. |
normalized_text | AlternativeUpdate Normalized text instead of raw one. |
StatusCode
Field | Description |
---|---|
code_type | enum CodeType Code type.
|
message | string Human readable message. |
RecognitionClassifierUpdate
Field | Description |
---|---|
window_type | enum WindowType Response window type
|
start_time_ms | int64 Start time of the audio segment used for classification |
end_time_ms | int64 End time of the audio segment used for classification |
classifier_result | RecognitionClassifierResult Result for dictionary-based classifier |
RecognitionClassifierResult
Field | Description |
---|---|
classifier | string Name of the triggered classifier |
highlights[] | PhraseHighlight List of highlights, i.e. parts of phrase that determine the result of the classification |
labels[] | RecognitionClassifierLabel Classifier predictions |
PhraseHighlight
Field | Description |
---|---|
text | string Text transcription of the highlighted audio segment |
start_time_ms | int64 Start time of the highlighted audio segment |
end_time_ms | int64 End time of the highlighted audio segment |
RecognitionClassifierLabel
Field | Description |
---|---|
label | string The label of the class predicted by the classifier |
confidence | double The prediction confidence |
SpeakerAnalysis
Field | Description |
---|---|
speaker_tag | string Speaker tag |
window_type | enum WindowType Response window type
|
speech_boundaries | AudioSegmentBoundaries Audio segment boundaries |
total_speech_ms | int64 Total speech duration |
speech_ratio | double Speech ratio within audio segment |
total_silence_ms | int64 Total silence duration |
silence_ratio | double Silence ratio within audio segment |
words_count | int64 Number of words in recognized speech |
letters_count | int64 Number of letters in recognized speech |
words_per_second | DescriptiveStatistics Descriptive statistics for words per second distribution |
letters_per_second | DescriptiveStatistics Descriptive statistics for letters per second distribution |
words_per_utterance | DescriptiveStatistics Descriptive statistics for words per utterance distribution |
letters_per_utterance | DescriptiveStatistics Descriptive statistics for letters per utterance distribution |
utterance_count | int64 Number of utterances |
utterance_duration_estimation | DescriptiveStatistics Descriptive statistics for utterance duration distribution |
AudioSegmentBoundaries
Field | Description |
---|---|
start_time_ms | int64 Audio segment start time |
end_time_ms | int64 Audio segment end time |
DescriptiveStatistics
Field | Description |
---|---|
min | double Minimum observed value |
max | double Maximum observed value |
mean | double Estimated mean of distribution |
std | double Estimated standard deviation of distribution |
quantiles[] | Quantile List of evaluated quantiles |
Quantile
Field | Description |
---|---|
level | double Quantile level in range (0, 1) |
value | double Quantile value |
ConversationAnalysis
Field | Description |
---|---|
conversation_boundaries | AudioSegmentBoundaries Audio segment boundaries |
total_simultaneous_silence_duration_ms | int64 Total simultaneous silence duration |
total_simultaneous_silence_ratio | double Simultaneous silence ratio within audio segment |
simultaneous_silence_duration_estimation | DescriptiveStatistics Descriptive statistics for simultaneous silence duration distribution |
total_simultaneous_speech_duration_ms | int64 Total simultaneous speech duration |
total_simultaneous_speech_ratio | double Simultaneous speech ratio within audio segment |
simultaneous_speech_duration_estimation | DescriptiveStatistics Descriptive statistics for simultaneous speech duration distribution |
speaker_interrupts[] | InterruptsEvaluation Interrupts description for every speaker |
total_speech_duration_ms | int64 Total speech duration, including both simultaneous and separate speech |
total_speech_ratio | double Total speech ratio within audio segment |
InterruptsEvaluation
Field | Description |
---|---|
speaker_tag | string Speaker tag |
interrupts_count | int64 Number of interrupts made by the speaker |
interrupts_duration_ms | int64 Total duration of all interrupts |
interrupts[] | AudioSegmentBoundaries Boundaries for every interrupt |