SpeechKit Recognition API v3, gRPC: Recognizer.RecognizeStreaming

Статья создана

Yandex Cloud

Обновлена 3 октября 2025 г.

gRPC request
StreamingRequest
StreamingOptions
RecognitionModelOptions
AudioFormatOptions
RawAudio
ContainerAudio
TextNormalizationOptions
LanguageRestrictionOptions
EouClassifierOptions
DefaultEouClassifier
ExternalEouClassifier
RecognitionClassifierOptions
RecognitionClassifier
SpeechAnalysisOptions
SpeakerLabelingOptions
SummarizationOptions
SummarizationProperty
JsonSchema
AudioChunk
SilenceChunk
Eou
StreamingResponse
SessionUuid
AudioCursors
AlternativeUpdate
Alternative
Word
LanguageEstimation
EouUpdate
FinalRefinement
StatusCode
RecognitionClassifierUpdate
RecognitionClassifierResult
PhraseHighlight
RecognitionClassifierLabel
SpeakerAnalysis
AudioSegmentBoundaries
DescriptiveStatistics
Quantile
ConversationAnalysis
InterruptsEvaluation
Summarization
SummarizationPropertyResult
ContentUsage

Performs bidirectional streaming speech recognition receiving results while sending audio.

gRPC request

rpc RecognizeStreaming (stream StreamingRequest) returns (stream StreamingResponse)

StreamingRequest

{
  // Includes only one of the fields `session_options`, `chunk`, `silence_chunk`, `eou`
  "session_options": {
    "recognition_model": {
      "model": "string",
      "audio_format": {
        // Includes only one of the fields `raw_audio`, `container_audio`
        "raw_audio": {
          "audio_encoding": "AudioEncoding",
          "sample_rate_hertz": "int64",
          "audio_channel_count": "int64"
        },
        "container_audio": {
          "container_audio_type": "ContainerAudioType"
        }
        // end of the list of possible fields
      },
      "text_normalization": {
        "text_normalization": "TextNormalization",
        "profanity_filter": "bool",
        "literature_text": "bool",
        "phone_formatting_mode": "PhoneFormattingMode"
      },
      "language_restriction": {
        "restriction_type": "LanguageRestrictionType",
        "language_code": [
          "string"
        ]
      },
      "audio_processing_type": "AudioProcessingType"
    },
    "eou_classifier": {
      // Includes only one of the fields `default_classifier`, `external_classifier`
      "default_classifier": {
        "type": "EouSensitivity",
        "max_pause_between_words_hint_ms": "int64"
      },
      "external_classifier": "ExternalEouClassifier"
      // end of the list of possible fields
    },
    "recognition_classifier": {
      "classifiers": [
        {
          "classifier": "string",
          "triggers": [
            "TriggerType"
          ]
        }
      ]
    },
    "speech_analysis": {
      "enable_speaker_analysis": "bool",
      "enable_conversation_analysis": "bool",
      "descriptive_statistics_quantiles": [
        "double"
      ]
    },
    "speaker_labeling": {
      "speaker_labeling": "SpeakerLabeling"
    },
    "summarization": {
      "model_uri": "string",
      "properties": [
        {
          "instruction": "string",
          // Includes only one of the fields `json_object`, `json_schema`
          "json_object": "bool",
          "json_schema": {
            "schema": "google.protobuf.Struct"
          }
          // end of the list of possible fields
        }
      ]
    }
  },
  "chunk": {
    "data": "bytes"
  },
  "silence_chunk": {
    "duration_ms": "int64"
  },
  "eou": "Eou"
  // end of the list of possible fields
}

Streaming audio request.
Events are control messages from user. First message should be session options. The next messages are audio data chunks or control messages.

Field	Description
session_options	StreamingOptions Session options. Should be the first message from user. Includes only one of the fields `session_options`, `chunk`, `silence_chunk`, `eou`.
chunk	AudioChunk Chunk with audio data. Includes only one of the fields `session_options`, `chunk`, `silence_chunk`, `eou`.
silence_chunk	SilenceChunk Chunk with silence. Includes only one of the fields `session_options`, `chunk`, `silence_chunk`, `eou`.
eou	Eou Request to end current utterance. Works only with external EOU detector. Includes only one of the fields `session_options`, `chunk`, `silence_chunk`, `eou`.

StreamingOptions

Field	Description
recognition_model	RecognitionModelOptions Configuration for speech recognition model.
eou_classifier	EouClassifierOptions Configuration for an end of utterance detection model.
recognition_classifier	RecognitionClassifierOptions Configuration for classifiers over speech recognition.
speech_analysis	SpeechAnalysisOptions Configuration for speech analysis over speech recognition.
speaker_labeling	SpeakerLabelingOptions Configuration for speaker labeling.
summarization	SummarizationOptions Summarization options.

RecognitionModelOptions

Field	Description
model	string Sets the recognition model for the cloud version of SpeechKit. For `Recognizer.RecognizeStreaming`, possible values are `general`, `general:rc`, `general:deprecated`. For `AsyncRecognizer.RecognizeFile`, possible values are `general`, `general:rc`, `general:deprecated`, `deferred-general`, `deferred-general:rc`, and `deferred-general:deprecated`. The model is ignored for SpeechKit Hybrid.
audio_format	AudioFormatOptions Specified input audio.
text_normalization	TextNormalizationOptions Text normalization options.
language_restriction	LanguageRestrictionOptions Possible languages in audio.
audio_processing_type	enum AudioProcessingType For `Recognizer.RecognizeStreaming`, defines the audio data processing mode. Default is `REAL_TIME`. For `AsyncRecognizer.RecognizeFile`, this field is ignored. `AUDIO_PROCESSING_TYPE_UNSPECIFIED` `REAL_TIME`: Process audio in mode optimized for real-time recognition, i.e. send partials and final responses as soon as possible. `FULL_DATA`: Process audio after all data was received.

AudioFormatOptions

Audio format options.

Field

Description

raw_audio

RawAudio

RAW audio without container.

Includes only one of the fields raw_audio, container_audio.

container_audio

ContainerAudio

Audio is wrapped in container.

Includes only one of the fields raw_audio, container_audio.

RawAudio

RAW Audio format spec (no container to infer type). Used in AudioFormat options.

Field	Description
audio_encoding	enum AudioEncoding Type of audio encoding. `AUDIO_ENCODING_UNSPECIFIED` `LINEAR16_PCM`: Audio bit depth 16-bit signed little-endian (Linear PCM).
sample_rate_hertz	int64 PCM sample rate.
audio_channel_count	int64 PCM channel count. Currently only single channel audio is supported in real-time recognition.

ContainerAudio

Audio with fixed type in container. Used in AudioFormat options.

Field

Description

container_audio_type

enum ContainerAudioType

Type of audio container.

CONTAINER_AUDIO_TYPE_UNSPECIFIED
WAV: Audio bit depth 16-bit signed little-endian (Linear PCM).
OGG_OPUS: Data is encoded using the OPUS audio codec and compressed using the OGG container format.
MP3: Data is encoded using MPEG-1/2 Layer III and compressed using the MP3 container format.

TextNormalizationOptions

Options for post-processing text results. The normalization levels depend on the settings and the language.
For detailed information, see documentation.

Field	Description
text_normalization	enum TextNormalization `TEXT_NORMALIZATION_UNSPECIFIED` `TEXT_NORMALIZATION_ENABLED`: Enable converting numbers, dates and time from text to numeric format. `TEXT_NORMALIZATION_DISABLED`: Disable all normalization. Default value.
profanity_filter	bool Profanity filter (default: false).
literature_text	bool Rewrite text in literature style (default: false).
phone_formatting_mode	enum PhoneFormattingMode Define phone formatting mode `PHONE_FORMATTING_MODE_UNSPECIFIED` `PHONE_FORMATTING_MODE_DISABLED`: Disable phone formatting

LanguageRestrictionOptions

Type of restriction for the list of languages expected in the incoming audio.

Field

Description

restriction_type

enum LanguageRestrictionType

Language restriction type.
All of these restrictions are used by the model as guidelines, not as strict rules.
The language is recognized for each sentence. If a sentence has phrases in different languages, all of them will be transcribed in the most probable language.

LANGUAGE_RESTRICTION_TYPE_UNSPECIFIED
WHITELIST: The list of most possible languages in the incoming audio.
BLACKLIST: The list of languages that are likely not to be included in the incoming audio.

language_code[]

string

The list of language codes to restrict recognition in the case of an auto model.

EouClassifierOptions

Field

Description

default_classifier

DefaultEouClassifier

Default EOU classifier provided by SpeechKit.

Includes only one of the fields default_classifier, external_classifier.

external_classifier

ExternalEouClassifier

EOU classifier enforced by external messages from user.

Includes only one of the fields default_classifier, external_classifier.

DefaultEouClassifier

Field

Description

type

enum EouSensitivity

EOU sensitivity.

EOU_SENSITIVITY_UNSPECIFIED
DEFAULT: Default and more conservative EOU detector.
HIGH: A high-sensitive and fast EOU detector, which may produce more false positives.

max_pause_between_words_hint_ms

int64

Hint for max pause between words. SpeechKit EOU detector could use this information to adjust the speed of the EOU detection.
For example, a long pause between words will help distinguish between the end of utterance from slow speech like One <long pause> two <long pause> three.
A short pause can be helpful if the speaker is speaking quickly and does not emphasize pauses between sentences.

ExternalEouClassifier

Use EOU provided by user.

Field	Description
Empty

RecognitionClassifierOptions

Field

Description

classifiers[]

RecognitionClassifier

List of classifiers to use. For detailed information and usage example, see documentation.

RecognitionClassifier

Field

Description

classifier

string

Classifier name

triggers[]

enum TriggerType

Describes the types of responses to which the classification results will come. Classification responses will follow the responses of the specified types.

TRIGGER_TYPE_UNSPECIFIED
ON_UTTERANCE: Apply classifier to utterance responses.
ON_FINAL: Apply classifier to final responses.
ON_PARTIAL: Apply classifier to partial responses.

SpeechAnalysisOptions

Field	Description
enable_speaker_analysis	bool Analyse speech for every speaker
enable_conversation_analysis	bool Analyse conversation of two speakers
descriptive_statistics_quantiles[]	double Quantile levels in range (0, 1) for descriptive statistics

SpeakerLabelingOptions

Field

Description

speaker_labeling

enum SpeakerLabeling

Specifies the execution of speaker labeling.

SPEAKER_LABELING_UNSPECIFIED
SPEAKER_LABELING_ENABLED: Enable speaker labeling.
SPEAKER_LABELING_DISABLED: Disable speaker labeling. Default value.

SummarizationOptions

Represents transcription summarization options.

Field

Description

model_uri

string

The ID of the model to be used for completion generation.

properties[]

SummarizationProperty

A list of suimmarizations to perform with transcription.

SummarizationProperty

Represents summarization entry for transcription.

Field	Description
instruction	string Summarization instruction for model.
json_object	bool When set to true, the model will return a valid JSON object. Be sure to ask the model explicitly for JSON. Otherwise, it may produce excessive whitespace and run indefinitely until it reaches the token limit. Includes only one of the fields `json_object`, `json_schema`. Specifies the format of the model's response.
json_schema	JsonSchema Enforces a specific JSON structure for the model's response based on a provided schema. Includes only one of the fields `json_object`, `json_schema`. Specifies the format of the model's response.

JsonSchema

Represents the expected structure of the model's response using a JSON Schema.

Field

Description

schema

google.protobuf.Struct

The JSON Schema that the model's output must conform to.

AudioChunk

Data chunk with audio.

Field

Description

data

bytes

Bytes with audio data.

SilenceChunk

Data chunk with silence.

Field

Description

duration_ms

int64

Duration of silence chunk in ms.

Eou

Force EOU.

Field	Description
Empty

StreamingResponse

{
  "session_uuid": {
    "uuid": "string",
    "user_request_id": "string"
  },
  "audio_cursors": {
    "received_data_ms": "int64",
    "reset_time_ms": "int64",
    "partial_time_ms": "int64",
    "final_time_ms": "int64",
    "final_index": "int64",
    "eou_time_ms": "int64"
  },
  "response_wall_time_ms": "int64",
  // Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`, `summarization`
  "partial": {
    "alternatives": [
      {
        "words": [
          {
            "text": "string",
            "start_time_ms": "int64",
            "end_time_ms": "int64"
          }
        ],
        "text": "string",
        "start_time_ms": "int64",
        "end_time_ms": "int64",
        "confidence": "double",
        "languages": [
          {
            "language_code": "string",
            "probability": "double"
          }
        ]
      }
    ],
    "channel_tag": "string"
  },
  "final": {
    "alternatives": [
      {
        "words": [
          {
            "text": "string",
            "start_time_ms": "int64",
            "end_time_ms": "int64"
          }
        ],
        "text": "string",
        "start_time_ms": "int64",
        "end_time_ms": "int64",
        "confidence": "double",
        "languages": [
          {
            "language_code": "string",
            "probability": "double"
          }
        ]
      }
    ],
    "channel_tag": "string"
  },
  "eou_update": {
    "time_ms": "int64"
  },
  "final_refinement": {
    "final_index": "int64",
    // Includes only one of the fields `normalized_text`
    "normalized_text": {
      "alternatives": [
        {
          "words": [
            {
              "text": "string",
              "start_time_ms": "int64",
              "end_time_ms": "int64"
            }
          ],
          "text": "string",
          "start_time_ms": "int64",
          "end_time_ms": "int64",
          "confidence": "double",
          "languages": [
            {
              "language_code": "string",
              "probability": "double"
            }
          ]
        }
      ],
      "channel_tag": "string"
    }
    // end of the list of possible fields
  },
  "status_code": {
    "code_type": "CodeType",
    "message": "string"
  },
  "classifier_update": {
    "window_type": "WindowType",
    "start_time_ms": "int64",
    "end_time_ms": "int64",
    "classifier_result": {
      "classifier": "string",
      "highlights": [
        {
          "text": "string",
          "start_time_ms": "int64",
          "end_time_ms": "int64"
        }
      ],
      "labels": [
        {
          "label": "string",
          "confidence": "double"
        }
      ]
    }
  },
  "speaker_analysis": {
    "speaker_tag": "string",
    "window_type": "WindowType",
    "speech_boundaries": {
      "start_time_ms": "int64",
      "end_time_ms": "int64"
    },
    "total_speech_ms": "int64",
    "speech_ratio": "double",
    "total_silence_ms": "int64",
    "silence_ratio": "double",
    "words_count": "int64",
    "letters_count": "int64",
    "words_per_second": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "letters_per_second": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "words_per_utterance": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "letters_per_utterance": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "utterance_count": "int64",
    "utterance_duration_estimation": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    }
  },
  "conversation_analysis": {
    "conversation_boundaries": {
      "start_time_ms": "int64",
      "end_time_ms": "int64"
    },
    "total_simultaneous_silence_duration_ms": "int64",
    "total_simultaneous_silence_ratio": "double",
    "simultaneous_silence_duration_estimation": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "total_simultaneous_speech_duration_ms": "int64",
    "total_simultaneous_speech_ratio": "double",
    "simultaneous_speech_duration_estimation": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "speaker_interrupts": [
      {
        "speaker_tag": "string",
        "interrupts_count": "int64",
        "interrupts_duration_ms": "int64",
        "interrupts": [
          {
            "start_time_ms": "int64",
            "end_time_ms": "int64"
          }
        ]
      }
    ],
    "total_speech_duration_ms": "int64",
    "total_speech_ratio": "double"
  },
  "summarization": {
    "results": [
      {
        "response": "string"
      }
    ],
    "content_usage": {
      "input_text_tokens": "int64",
      "completion_tokens": "int64",
      "total_tokens": "int64"
    }
  },
  // end of the list of possible fields
  "channel_tag": "string"
}

Responses from server.
Each response contains session UUID, AudioCursors, and specific event.

Field	Description
session_uuid	SessionUuid Session identifier.
audio_cursors	AudioCursors Progress bar for stream session recognition: how many data we obtained; final and partial times; etc.
response_wall_time_ms	int64 Wall clock on server side. This is time when server wrote results to stream.
partial	AlternativeUpdate Partial results, server will send them regularly after enough audio data was received from user. This is the current text estimation from `final_time_ms` to `partial_time_ms`. Could change after new data will arrive. Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`, `summarization`.
final	AlternativeUpdate Final results, the recognition is now fixed until `final_time_ms`. For now, final is sent only if the EOU event was triggered. This behavior could be changed in future releases. Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`, `summarization`.
eou_update	EouUpdate After EOU classifier, send the message with final, send the EouUpdate with time of EOU before eou_update we send final with the same time. there could be several finals before eou update. Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`, `summarization`.
final_refinement	FinalRefinement For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Final normalization will introduce additional latency. Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`, `summarization`.
status_code	StatusCode Status messages, send by server with fixed interval (keep-alive). Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`, `summarization`.
classifier_update	RecognitionClassifierUpdate Result of the triggered classifier. Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`, `summarization`.
speaker_analysis	SpeakerAnalysis Speech statistics for every speaker. Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`, `summarization`.
conversation_analysis	ConversationAnalysis Conversation statistics. Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`, `summarization`.
summarization	Summarization Summary. Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`, `summarization`.
channel_tag	string Tag for distinguish audio channels.

SessionUuid

Session identifier.

Field

Description

uuid

string

Internal session identifier.

user_request_id

string

User session identifier.

AudioCursors

AudioCursors are state of ASR recognition stream.

Field	Description
received_data_ms	int64 Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server.
reset_time_ms	int64 Input stream reset data.
partial_time_ms	int64 How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data to update recognition results (includes silence as well).
final_time_ms	int64 Time of last final. This cursor is moved when server decides that recognition from start of audio until `final_time_ms` will not change anymore usually this event is followed by EOU detection. This behavior could change in future.
final_index	int64 This is index of last final server send. Incremented after each new final.
eou_time_ms	int64 Estimated time of EOU. Cursor is updated after each new EOU is sent. For external classifier this equals to `received_data_ms` at the moment EOU event arrives. For internal classifier this is estimation of time. The time is not exact and has the same guarantees as word timings.

AlternativeUpdate

Update of hypothesis.

Field

Description

alternatives[]

Alternative

List of hypothesis for timeframes.

channel_tag

string

Alternative

Recognition of specific time frame.

Field	Description
words[]	Word Words in time frame.
text	string Text in time frame.
start_time_ms	int64 Start of time frame.
end_time_ms	int64 End of time frame.
confidence	double The hypothesis confidence. Currently is not used.
languages[]	LanguageEstimation Distribution over possible languages.

Word

Recognized word.

Field	Description
text	string Word text.
start_time_ms	int64 Estimation of word start time in ms.
end_time_ms	int64 Estimation of word end time in ms.

LanguageEstimation

Estimation of language and its probability.

Field

Description

language_code

string

Language tag in IETF BCP 47 format, consisting of ISO 639-1 language code and ISO 3166-1 country code (e.g., en-US, ru-RU).

probability

double

Estimation of language probability.

EouUpdate

Update information for external End of Utterance.

Field

Description

time_ms

int64

EOU estimated time.

FinalRefinement

Refinement for final hypo. For example, text normalization is refinement.

Field

Description

final_index

int64

Index of final for which server sends additional information.

normalized_text

AlternativeUpdate

Normalized text instead of raw one.

Includes only one of the fields normalized_text.

Type of refinement.

StatusCode

Status message.

Field

Description

code_type

enum CodeType

Code type.

CODE_TYPE_UNSPECIFIED
WORKING: All good.
WARNING: For example, if speech is sent not in real-time or context is unknown and we've made fallback.
CLOSED: After session was closed.

message

string

Human readable message.

RecognitionClassifierUpdate

Field	Description
window_type	enum WindowType Response window type. `WINDOW_TYPE_UNSPECIFIED` `LAST_UTTERANCE`: The result of applying the classifier to the last utterance response. `LAST_FINAL`: The result of applying the classifier to the last final response. `LAST_PARTIAL`: The result of applying the classifier to the last partial response.
start_time_ms	int64 Start time of the audio segment used for classification.
end_time_ms	int64 End time of the audio segment used for classification.
classifier_result	RecognitionClassifierResult Result for dictionary-based classifier.

RecognitionClassifierResult

Field	Description
classifier	string Name of the triggered classifier.
highlights[]	PhraseHighlight List of highlights, i.e. parts of phrase that determine the result of the classification.
labels[]	RecognitionClassifierLabel Classifier predictions.

PhraseHighlight

Field	Description
text	string Text transcription of the highlighted audio segment.
start_time_ms	int64 Start time of the highlighted audio segment.
end_time_ms	int64 End time of the highlighted audio segment.

RecognitionClassifierLabel

Field

Description

label

string

The label of the class predicted by the classifier.

confidence

double

The prediction confidence.

SpeakerAnalysis

Field	Description
speaker_tag	string Speaker tag.
window_type	enum WindowType Response window type. `WINDOW_TYPE_UNSPECIFIED` `TOTAL`: Stats for all received audio. `LAST_UTTERANCE`: Stats for last utterance.
speech_boundaries	AudioSegmentBoundaries Audio segment boundaries.
total_speech_ms	int64 Total speech duration.
speech_ratio	double Speech ratio within audio segment.
total_silence_ms	int64 Total duration of silence.
silence_ratio	double Silence ratio within audio segment.
words_count	int64 Number of words in recognized speech.
letters_count	int64 Number of letters in recognized speech.
words_per_second	DescriptiveStatistics Descriptive statistics for words per second distribution.
letters_per_second	DescriptiveStatistics Descriptive statistics for letters per second distribution.
words_per_utterance	DescriptiveStatistics Descriptive statistics for words per utterance distribution.
letters_per_utterance	DescriptiveStatistics Descriptive statistics for letters per utterance distribution.
utterance_count	int64 Number of utterances
utterance_duration_estimation	DescriptiveStatistics Descriptive statistics for utterance duration distribution

AudioSegmentBoundaries

Field

Description

start_time_ms

int64

Audio segment start time.

end_time_ms

int64

Audio segment end time.

DescriptiveStatistics

Field	Description
min	double Minimum observed value.
max	double Maximum observed value.
mean	double Estimated mean of distribution.
std	double Estimated standard deviation of distribution.
quantiles[]	Quantile List of evaluated quantiles.

Quantile

Field

Description

level

double

Quantile level in range (0, 1).

value

double

Quantile value.

ConversationAnalysis

Field	Description
conversation_boundaries	AudioSegmentBoundaries Audio segment boundaries.
total_simultaneous_silence_duration_ms	int64 Total simultaneous silence duration.
total_simultaneous_silence_ratio	double Simultaneous silence ratio within audio segment.
simultaneous_silence_duration_estimation	DescriptiveStatistics Descriptive statistics for simultaneous silence duration distribution.
total_simultaneous_speech_duration_ms	int64 Total simultaneous speech duration.
total_simultaneous_speech_ratio	double Simultaneous speech ratio within audio segment.
simultaneous_speech_duration_estimation	DescriptiveStatistics Descriptive statistics for simultaneous speech duration distribution.
speaker_interrupts[]	InterruptsEvaluation Interrupts description for every speaker.
total_speech_duration_ms	int64 Total speech duration, including both simultaneous and separate speech.
total_speech_ratio	double Total speech ratio within audio segment.

InterruptsEvaluation

Field	Description
speaker_tag	string Speaker tag.
interrupts_count	int64 Number of interrupts made by the speaker.
interrupts_duration_ms	int64 Total duration of all interrupts.
interrupts[]	AudioSegmentBoundaries Boundaries for every interrupt.

Summarization

Field

Description

results[]

SummarizationPropertyResult

A list of summarizations of transcription.

content_usage

ContentUsage

A set of statistics describing the number of content tokens used by the completion model.

SummarizationPropertyResult

Represents summarization response entry for transcription.

Field

Description

response

string

Summarization response text.

ContentUsage

An object representing the number of content tokens used by the completion model.

Field	Description
input_text_tokens	int64 The number of tokens in the textual part of the model input.
completion_tokens	int64 The number of tokens in the generated completion.
total_tokens	int64 The total number of tokens, including all input tokens and all generated tokens.

SpeechKit Recognition API v3, gRPC: Recognizer.RecognizeStreaming

gRPC requestgRPC request

StreamingRequestStreamingRequest

StreamingOptionsStreamingOptions

RecognitionModelOptionsRecognitionModelOptions

AudioFormatOptionsAudioFormatOptions

RawAudioRawAudio

ContainerAudioContainerAudio

TextNormalizationOptionsTextNormalizationOptions

LanguageRestrictionOptionsLanguageRestrictionOptions

EouClassifierOptionsEouClassifierOptions

DefaultEouClassifierDefaultEouClassifier

ExternalEouClassifierExternalEouClassifier

RecognitionClassifierOptionsRecognitionClassifierOptions

RecognitionClassifierRecognitionClassifier

SpeechAnalysisOptionsSpeechAnalysisOptions

SpeakerLabelingOptionsSpeakerLabelingOptions

SummarizationOptionsSummarizationOptions

SummarizationPropertySummarizationProperty

JsonSchemaJsonSchema

AudioChunkAudioChunk

SilenceChunkSilenceChunk

EouEou

StreamingResponseStreamingResponse

SessionUuidSessionUuid

AudioCursorsAudioCursors

AlternativeUpdateAlternativeUpdate

AlternativeAlternative

WordWord

LanguageEstimationLanguageEstimation

EouUpdateEouUpdate

FinalRefinementFinalRefinement

StatusCodeStatusCode

RecognitionClassifierUpdateRecognitionClassifierUpdate

RecognitionClassifierResultRecognitionClassifierResult

PhraseHighlightPhraseHighlight

RecognitionClassifierLabelRecognitionClassifierLabel

SpeakerAnalysisSpeakerAnalysis

AudioSegmentBoundariesAudioSegmentBoundaries

DescriptiveStatisticsDescriptiveStatistics

QuantileQuantile

ConversationAnalysisConversationAnalysis

InterruptsEvaluationInterruptsEvaluation

SummarizationSummarization

SummarizationPropertyResultSummarizationPropertyResult

ContentUsageContentUsage

Была ли статья полезна?

gRPC request

StreamingRequest

StreamingOptions

RecognitionModelOptions

AudioFormatOptions

RawAudio

ContainerAudio

TextNormalizationOptions

LanguageRestrictionOptions

EouClassifierOptions

DefaultEouClassifier

ExternalEouClassifier

RecognitionClassifierOptions

RecognitionClassifier

SpeechAnalysisOptions

SpeakerLabelingOptions

SummarizationOptions

SummarizationProperty

JsonSchema

AudioChunk

SilenceChunk

Eou

StreamingResponse

SessionUuid

AudioCursors

AlternativeUpdate

Alternative

Word

LanguageEstimation

EouUpdate

FinalRefinement

StatusCode

RecognitionClassifierUpdate

RecognitionClassifierResult

PhraseHighlight

RecognitionClassifierLabel

SpeakerAnalysis

AudioSegmentBoundaries

DescriptiveStatistics

Quantile

ConversationAnalysis

InterruptsEvaluation

Summarization

SummarizationPropertyResult

ContentUsage