SpeechKit Hybrid Recognition API, gRPC: Recognizer
Written by
Updated at January 25, 2024
- Calls Recognizer
- RecognizeStreaming
- StreamingRequest
- StreamingOptions
- RecognitionModelOptions
- AudioFormatOptions
- RawAudio
- ContainerAudio
- TextNormalizationOptions
- LanguageRestrictionOptions
- EouClassifierOptions
- DefaultEouClassifier
- ExternalEouClassifier
- RecognitionClassifierOptions
- RecognitionClassifier
- SpeechAnalysisOptions
- SpeakerLabelingOptions
- AudioChunk
- SilenceChunk
- Eou
- StreamingResponse
- SessionUuid
- AudioCursors
- AlternativeUpdate
- Alternative
- Word
- LanguageEstimation
- EouUpdate
- FinalRefinement
- StatusCode
- RecognitionClassifierUpdate
- RecognitionClassifierResult
- PhraseHighlight
- RecognitionClassifierLabel
- SpeakerAnalysis
- AudioSegmentBoundaries
- DescriptiveStatistics
- Quantile
- ConversationAnalysis
- InterruptsEvaluation
A set of methods for voice recognition.
| Call | Description |
|---|---|
| RecognizeStreaming | Expects audio in real-time |
Calls Recognizer
RecognizeStreaming
Expects audio in real-time
rpc RecognizeStreaming (stream StreamingRequest) returns (stream StreamingResponse)
StreamingRequest
| Field | Description |
|---|---|
| Event | oneof: session_options, chunk, silence_chunk or eou |
| session_options | StreamingOptions Session options. Should be the first message from user. |
| chunk | AudioChunk Chunk with audio data. |
| silence_chunk | SilenceChunk Chunk with silence. |
| eou | Eou Request to end current utterance. Works only with external EOU detector. |
StreamingOptions
| Field | Description |
|---|---|
| recognition_model | RecognitionModelOptions Configuration for speech recognition model. |
| eou_classifier | EouClassifierOptions Configuration for end of utterance detection model. |
| recognition_classifier | RecognitionClassifierOptions Configuration for classifiers over speech recognition. |
| speech_analysis | SpeechAnalysisOptions Configuration for speech analysis over speech recognition. |
| speaker_labeling | SpeakerLabelingOptions Configuration for speaker labeling |
RecognitionModelOptions
| Field | Description |
|---|---|
| model | string Sets the recognition model for the cloud version of SpeechKit. Possible values: 'general', 'general:rc', 'general:deprecated'. The model is ignored for SpeechKit Hybrid. |
| audio_format | AudioFormatOptions Specified input audio. |
| text_normalization | TextNormalizationOptions Text normalization options. |
| language_restriction | LanguageRestrictionOptions Possible languages in audio. |
| audio_processing_type | enum AudioProcessingType How to deal with audio data (in real time, after all data is received, etc). Default is REAL_TIME.
|
AudioFormatOptions
| Field | Description |
|---|---|
| AudioFormat | oneof: raw_audio or container_audio |
| raw_audio | RawAudio Audio without container. |
| container_audio | ContainerAudio Audio is wrapped in container. |
RawAudio
| Field | Description |
|---|---|
| audio_encoding | enum AudioEncoding Type of audio encoding
|
| sample_rate_hertz | int64 PCM sample rate |
| audio_channel_count | int64 PCM channel count. Currently only single channel audio is supported in real-time recognition. |
ContainerAudio
| Field | Description |
|---|---|
| container_audio_type | enum ContainerAudioType Type of audio container.
|
TextNormalizationOptions
| Field | Description |
|---|---|
| text_normalization | enum TextNormalization Normalization
|
| profanity_filter | bool Profanity filter (default: false). |
| literature_text | bool Rewrite text in literature style (default: false). |
| phone_formatting_mode | enum PhoneFormattingMode Define phone formatting mode
|
LanguageRestrictionOptions
| Field | Description |
|---|---|
| restriction_type | enum LanguageRestrictionType Language restriction type
|
| language_code[] | string The list of language codes to restrict recognition in the case of an auto model |
EouClassifierOptions
| Field | Description |
|---|---|
| Classifier | oneof: default_classifier or external_classifierType of EOU classifier. |
| default_classifier | DefaultEouClassifier EOU classifier provided by SpeechKit. Default. |
| external_classifier | ExternalEouClassifier EOU is enforced by external messages from user. |
DefaultEouClassifier
| Field | Description |
|---|---|
| type | enum EouSensitivity EOU sensitivity. Currently two levels, faster with more error and more conservative (our default). |
| max_pause_between_words_hint_ms | int64 Hint for max pause between words. Our EOU detector could use this information to distinguish between end of utterance and slow speech (like one |
ExternalEouClassifier
Empty.
RecognitionClassifierOptions
| Field | Description |
|---|---|
| classifiers[] | RecognitionClassifier List of classifiers to use |
RecognitionClassifier
| Field | Description |
|---|---|
| classifier | string Classifier name |
| triggers[] | enum TriggerType Describes the types of responses to which the classification results will come
|
SpeechAnalysisOptions
| Field | Description |
|---|---|
| enable_speaker_analysis | bool Analyse speech for every speaker |
| enable_conversation_analysis | bool Analyse conversation of two speakers |
| descriptive_statistics_quantiles[] | double Quantile levels in range (0, 1) for descriptive statistics |
SpeakerLabelingOptions
| Field | Description |
|---|---|
| speaker_labeling | enum SpeakerLabeling Specifies the execution of speaker labeling. Default is SPEAKER_LABELING_DISABLED.
|
AudioChunk
| Field | Description |
|---|---|
| data | bytes Bytes with audio data. |
SilenceChunk
| Field | Description |
|---|---|
| duration_ms | int64 Duration of silence chunk in ms. |
Eou
Empty.
StreamingResponse
| Field | Description |
|---|---|
| session_uuid | SessionUuid Session identifier |
| audio_cursors | AudioCursors Progress bar for stream session recognition: how many data we obtained; final and partial times; etc. |
| response_wall_time_ms | int64 Wall clock on server side. This is time when server wrote results to stream |
| Event | oneof: partial, final, eou_update, final_refinement, status_code, classifier_update, speaker_analysis or conversation_analysis |
| partial | AlternativeUpdate Partial results, server will send them regularly after enough audio data was received from user. This are current text estimation from final_time_ms to partial_time_ms. Could change after new data will arrive. |
| final | AlternativeUpdate Final results, the recognition is now fixed until final_time_ms. For now, final is sent only if the EOU event was triggered. This could be change in future releases. |
| eou_update | EouUpdate After EOU classifier, send the message with final, send the EouUpdate with time of EOU before eou_update we send final with the same time. there could be several finals before eou update. |
| final_refinement | FinalRefinement For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Final normalization will introduce additional latency. |
| status_code | StatusCode Status messages, send by server with fixed interval (keep-alive). |
| classifier_update | RecognitionClassifierUpdate Result of the triggered classifier |
| speaker_analysis | SpeakerAnalysis Speech statistics for every speaker |
| conversation_analysis | ConversationAnalysis Conversation statistics |
| channel_tag | string Tag for distinguish audio channels. |
SessionUuid
| Field | Description |
|---|---|
| uuid | string Internal session identifier. |
| user_request_id | string User session identifier. |
AudioCursors
| Field | Description |
|---|---|
| received_data_ms | int64 Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server. |
| reset_time_ms | int64 Input stream reset data. |
| partial_time_ms | int64 How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data to update recognition results (includes silence as well). |
| final_time_ms | int64 Time of last final. This cursor is moved when server decides that recognition from start of audio until final_time_ms will not change anymore usually this even is followed by EOU detection (but this could change in future). |
| final_index | int64 This is index of last final server send. Incremented after each new final. |
| eou_time_ms | int64 Estimated time of EOU. Cursor is updated after each new EOU is sent. For external classifier this equals to received_data_ms at the moment EOU event arrives. For internal classifier this is estimation of time. The time is not exact and has the same guarantees as word timings. |
AlternativeUpdate
| Field | Description |
|---|---|
| alternatives[] | Alternative List of hypothesis for timeframes. |
| channel_tag | string |
Alternative
| Field | Description |
|---|---|
| words[] | Word Words in time frame. |
| text | string Text in time frame. |
| start_time_ms | int64 Start of time frame. |
| end_time_ms | int64 End of time frame. |
| confidence | double The hypothesis confidence. Currently is not used. |
| languages[] | LanguageEstimation Distribution over possible languages. |
Word
| Field | Description |
|---|---|
| text | string Word text. |
| start_time_ms | int64 Estimation of word start time in ms. |
| end_time_ms | int64 Estimation of word end time in ms. |
LanguageEstimation
| Field | Description |
|---|---|
| language_code | string Language code in ISO 639-1 format. |
| probability | double Estimation of language probability. |
EouUpdate
| Field | Description |
|---|---|
| time_ms | int64 EOU estimated time. |
FinalRefinement
| Field | Description |
|---|---|
| final_index | int64 Index of final for which server sends additional information. |
| Type | oneof: normalized_textType of refinement. |
| normalized_text | AlternativeUpdate Normalized text instead of raw one. |
StatusCode
| Field | Description |
|---|---|
| code_type | enum CodeType Code type.
|
| message | string Human readable message. |
RecognitionClassifierUpdate
| Field | Description |
|---|---|
| window_type | enum WindowType Response window type
|
| start_time_ms | int64 Start time of the audio segment used for classification |
| end_time_ms | int64 End time of the audio segment used for classification |
| classifier_result | RecognitionClassifierResult Result for dictionary-based classifier |
RecognitionClassifierResult
| Field | Description |
|---|---|
| classifier | string Name of the triggered classifier |
| highlights[] | PhraseHighlight List of highlights, i.e. parts of phrase that determine the result of the classification |
| labels[] | RecognitionClassifierLabel Classifier predictions |
PhraseHighlight
| Field | Description |
|---|---|
| text | string Text transcription of the highlighted audio segment |
| start_time_ms | int64 Start time of the highlighted audio segment |
| end_time_ms | int64 End time of the highlighted audio segment |
RecognitionClassifierLabel
| Field | Description |
|---|---|
| label | string The label of the class predicted by the classifier |
| confidence | double The prediction confidence |
SpeakerAnalysis
| Field | Description |
|---|---|
| speaker_tag | string Speaker tag |
| window_type | enum WindowType Response window type
|
| speech_boundaries | AudioSegmentBoundaries Audio segment boundaries |
| total_speech_ms | int64 Total speech duration |
| speech_ratio | double Speech ratio within audio segment |
| total_silence_ms | int64 Total silence duration |
| silence_ratio | double Silence ratio within audio segment |
| words_count | int64 Number of words in recognized speech |
| letters_count | int64 Number of letters in recognized speech |
| words_per_second | DescriptiveStatistics Descriptive statistics for words per second distribution |
| letters_per_second | DescriptiveStatistics Descriptive statistics for letters per second distribution |
| words_per_utterance | DescriptiveStatistics Descriptive statistics for words per utterance distribution |
| letters_per_utterance | DescriptiveStatistics Descriptive statistics for letters per utterance distribution |
| utterance_count | int64 Number of utterances |
| utterance_duration_estimation | DescriptiveStatistics Descriptive statistics for utterance duration distribution |
AudioSegmentBoundaries
| Field | Description |
|---|---|
| start_time_ms | int64 Audio segment start time |
| end_time_ms | int64 Audio segment end time |
DescriptiveStatistics
| Field | Description |
|---|---|
| min | double Minimum observed value |
| max | double Maximum observed value |
| mean | double Estimated mean of distribution |
| std | double Estimated standard deviation of distribution |
| quantiles[] | Quantile List of evaluated quantiles |
Quantile
| Field | Description |
|---|---|
| level | double Quantile level in range (0, 1) |
| value | double Quantile value |
ConversationAnalysis
| Field | Description |
|---|---|
| conversation_boundaries | AudioSegmentBoundaries Audio segment boundaries |
| total_simultaneous_silence_duration_ms | int64 Total simultaneous silence duration |
| total_simultaneous_silence_ratio | double Simultaneous silence ratio within audio segment |
| simultaneous_silence_duration_estimation | DescriptiveStatistics Descriptive statistics for simultaneous silence duration distribution |
| total_simultaneous_speech_duration_ms | int64 Total simultaneous speech duration |
| total_simultaneous_speech_ratio | double Simultaneous speech ratio within audio segment |
| simultaneous_speech_duration_estimation | DescriptiveStatistics Descriptive statistics for simultaneous speech duration distribution |
| speaker_interrupts[] | InterruptsEvaluation Interrupts description for every speaker |
| total_speech_duration_ms | int64 Total speech duration, including both simultaneous and separate speech |
| total_speech_ratio | double Total speech ratio within audio segment |
InterruptsEvaluation
| Field | Description |
|---|---|
| speaker_tag | string Speaker tag |
| interrupts_count | int64 Number of interrupts made by the speaker |
| interrupts_duration_ms | int64 Total duration of all interrupts |
| interrupts[] | AudioSegmentBoundaries Boundaries for every interrupt |