Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Yandex SpeechKit
  • SpeechKit technology overview
    • Overview
    • API authentication
        • Overview
          • Overview
          • RecognizeStreaming
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy

In this article:

  • gRPC request
  • StreamingRequest
  • StreamingOptions
  • RecognitionModelOptions
  • AudioFormatOptions
  • RawAudio
  • ContainerAudio
  • TextNormalizationOptions
  • LanguageRestrictionOptions
  • EouClassifierOptions
  • DefaultEouClassifier
  • ExternalEouClassifier
  • RecognitionClassifierOptions
  • RecognitionClassifier
  • SpeechAnalysisOptions
  • SpeakerLabelingOptions
  • AudioChunk
  • SilenceChunk
  • Eou
  • StreamingResponse
  • SessionUuid
  • AudioCursors
  • AlternativeUpdate
  • Alternative
  • Word
  • LanguageEstimation
  • EouUpdate
  • FinalRefinement
  • StatusCode
  • RecognitionClassifierUpdate
  • RecognitionClassifierResult
  • PhraseHighlight
  • RecognitionClassifierLabel
  • SpeakerAnalysis
  • AudioSegmentBoundaries
  • DescriptiveStatistics
  • Quantile
  • ConversationAnalysis
  • InterruptsEvaluation
  1. API references
  2. Recognition
  3. gRPC API v3
  4. Recognizer
  5. RecognizeStreaming

SpeechKit Recognition API v3, gRPC: Recognizer.RecognizeStreaming

Written by
Yandex Cloud
Updated at November 26, 2024
  • gRPC request
  • StreamingRequest
  • StreamingOptions
  • RecognitionModelOptions
  • AudioFormatOptions
  • RawAudio
  • ContainerAudio
  • TextNormalizationOptions
  • LanguageRestrictionOptions
  • EouClassifierOptions
  • DefaultEouClassifier
  • ExternalEouClassifier
  • RecognitionClassifierOptions
  • RecognitionClassifier
  • SpeechAnalysisOptions
  • SpeakerLabelingOptions
  • AudioChunk
  • SilenceChunk
  • Eou
  • StreamingResponse
  • SessionUuid
  • AudioCursors
  • AlternativeUpdate
  • Alternative
  • Word
  • LanguageEstimation
  • EouUpdate
  • FinalRefinement
  • StatusCode
  • RecognitionClassifierUpdate
  • RecognitionClassifierResult
  • PhraseHighlight
  • RecognitionClassifierLabel
  • SpeakerAnalysis
  • AudioSegmentBoundaries
  • DescriptiveStatistics
  • Quantile
  • ConversationAnalysis
  • InterruptsEvaluation

Expects audio in real-time

gRPC request

rpc RecognizeStreaming (stream StreamingRequest) returns (stream StreamingResponse)

StreamingRequest

{
  // Includes only one of the fields `session_options`, `chunk`, `silence_chunk`, `eou`
  "session_options": {
    "recognition_model": {
      "model": "string",
      "audio_format": {
        // Includes only one of the fields `raw_audio`, `container_audio`
        "raw_audio": {
          "audio_encoding": "AudioEncoding",
          "sample_rate_hertz": "int64",
          "audio_channel_count": "int64"
        },
        "container_audio": {
          "container_audio_type": "ContainerAudioType"
        }
        // end of the list of possible fields
      },
      "text_normalization": {
        "text_normalization": "TextNormalization",
        "profanity_filter": "bool",
        "literature_text": "bool",
        "phone_formatting_mode": "PhoneFormattingMode"
      },
      "language_restriction": {
        "restriction_type": "LanguageRestrictionType",
        "language_code": [
          "string"
        ]
      },
      "audio_processing_type": "AudioProcessingType"
    },
    "eou_classifier": {
      // Includes only one of the fields `default_classifier`, `external_classifier`
      "default_classifier": {
        "type": "EouSensitivity",
        "max_pause_between_words_hint_ms": "int64"
      },
      "external_classifier": "ExternalEouClassifier"
      // end of the list of possible fields
    },
    "recognition_classifier": {
      "classifiers": [
        {
          "classifier": "string",
          "triggers": [
            "TriggerType"
          ]
        }
      ]
    },
    "speech_analysis": {
      "enable_speaker_analysis": "bool",
      "enable_conversation_analysis": "bool",
      "descriptive_statistics_quantiles": [
        "double"
      ]
    },
    "speaker_labeling": {
      "speaker_labeling": "SpeakerLabeling"
    }
  },
  "chunk": {
    "data": "bytes"
  },
  "silence_chunk": {
    "duration_ms": "int64"
  },
  "eou": "Eou"
  // end of the list of possible fields
}

Streaming audio request
Events are control messages from user.
First message should be session options.
The next messages are audio data chunks or control messages.

Field

Description

session_options

StreamingOptions

Session options. Should be the first message from user.

Includes only one of the fields session_options, chunk, silence_chunk, eou.

chunk

AudioChunk

Chunk with audio data.

Includes only one of the fields session_options, chunk, silence_chunk, eou.

silence_chunk

SilenceChunk

Chunk with silence.

Includes only one of the fields session_options, chunk, silence_chunk, eou.

eou

Eou

Request to end current utterance. Works only with external EOU detector.

Includes only one of the fields session_options, chunk, silence_chunk, eou.

StreamingOptions

Field

Description

recognition_model

RecognitionModelOptions

Configuration for speech recognition model.

eou_classifier

EouClassifierOptions

Configuration for end of utterance detection model.

recognition_classifier

RecognitionClassifierOptions

Configuration for classifiers over speech recognition.

speech_analysis

SpeechAnalysisOptions

Configuration for speech analysis over speech recognition.

speaker_labeling

SpeakerLabelingOptions

Configuration for speaker labeling

RecognitionModelOptions

Field

Description

model

string

Sets the recognition model for the cloud version of SpeechKit. Possible values: 'general', 'general:rc', 'general:deprecated'.
The model is ignored for SpeechKit Hybrid.

audio_format

AudioFormatOptions

Specified input audio.

text_normalization

TextNormalizationOptions

Text normalization options.

language_restriction

LanguageRestrictionOptions

Possible languages in audio.

audio_processing_type

enum AudioProcessingType

How to deal with audio data (in real time, after all data is received, etc). Default is REAL_TIME.

  • AUDIO_PROCESSING_TYPE_UNSPECIFIED
  • REAL_TIME: Process audio in mode optimized for real-time recognition, i.e. send partials and final responses as soon as possible
  • FULL_DATA: Process audio after all data was received

AudioFormatOptions

Audio format options.

Field

Description

raw_audio

RawAudio

Audio without container.

Includes only one of the fields raw_audio, container_audio.

container_audio

ContainerAudio

Audio is wrapped in container.

Includes only one of the fields raw_audio, container_audio.

RawAudio

RAW Audio format spec (no container to infer type). Used in AudioFormat options.

Field

Description

audio_encoding

enum AudioEncoding

Type of audio encoding

  • AUDIO_ENCODING_UNSPECIFIED
  • LINEAR16_PCM: Audio bit depth 16-bit signed little-endian (Linear PCM).

sample_rate_hertz

int64

PCM sample rate

audio_channel_count

int64

PCM channel count. Currently only single channel audio is supported in real-time recognition.

ContainerAudio

Audio with fixed type in container. Used in AudioFormat options.

Field

Description

container_audio_type

enum ContainerAudioType

Type of audio container.

  • CONTAINER_AUDIO_TYPE_UNSPECIFIED
  • WAV: Audio bit depth 16-bit signed little-endian (Linear PCM).
  • OGG_OPUS: Data is encoded using the OPUS audio codec and compressed using the OGG container format.
  • MP3: Data is encoded using MPEG-1/2 Layer III and compressed using the MP3 container format.

TextNormalizationOptions

Options

Field

Description

text_normalization

enum TextNormalization

  • TEXT_NORMALIZATION_UNSPECIFIED
  • TEXT_NORMALIZATION_ENABLED: Enable normalization
  • TEXT_NORMALIZATION_DISABLED: Disable normalization

profanity_filter

bool

Profanity filter (default: false).

literature_text

bool

Rewrite text in literature style (default: false).

phone_formatting_mode

enum PhoneFormattingMode

Define phone formatting mode

  • PHONE_FORMATTING_MODE_UNSPECIFIED
  • PHONE_FORMATTING_MODE_DISABLED: Disable phone formatting

LanguageRestrictionOptions

Type of restriction for the list of languages expected in the incoming speech stream.

Field

Description

restriction_type

enum LanguageRestrictionType

Language restriction type

  • LANGUAGE_RESTRICTION_TYPE_UNSPECIFIED
  • WHITELIST: The allowing list. The incoming audio can contain only the listed languages.
  • BLACKLIST: The forbidding list. The incoming audio cannot contain the listed languages.

language_code[]

string

The list of language codes to restrict recognition in the case of an auto model

EouClassifierOptions

Field

Description

default_classifier

DefaultEouClassifier

EOU classifier provided by SpeechKit. Default.

Includes only one of the fields default_classifier, external_classifier.

Type of EOU classifier.

external_classifier

ExternalEouClassifier

EOU is enforced by external messages from user.

Includes only one of the fields default_classifier, external_classifier.

Type of EOU classifier.

DefaultEouClassifier

Field

Description

type

enum EouSensitivity

EOU sensitivity. Currently two levels, faster with more error and more conservative (our default).

  • EOU_SENSITIVITY_UNSPECIFIED
  • DEFAULT
  • HIGH

max_pause_between_words_hint_ms

int64

Hint for max pause between words. Our EOU detector could use this information to distinguish between end of utterance and slow speech (like one two three, etc).

ExternalEouClassifier

Use EOU provided by user

Field

Description

Empty

RecognitionClassifierOptions

Field

Description

classifiers[]

RecognitionClassifier

List of classifiers to use

RecognitionClassifier

Field

Description

classifier

string

Classifier name

triggers[]

enum TriggerType

Describes the types of responses to which the classification results will come

  • TRIGGER_TYPE_UNSPECIFIED
  • ON_UTTERANCE: Apply classifier to utterance responses
  • ON_FINAL: Apply classifier to final responses
  • ON_PARTIAL: Apply classifier to partial responses

SpeechAnalysisOptions

Field

Description

enable_speaker_analysis

bool

Analyse speech for every speaker

enable_conversation_analysis

bool

Analyse conversation of two speakers

descriptive_statistics_quantiles[]

double

Quantile levels in range (0, 1) for descriptive statistics

SpeakerLabelingOptions

Field

Description

speaker_labeling

enum SpeakerLabeling

Specifies the execution of speaker labeling. Default is SPEAKER_LABELING_DISABLED.

  • SPEAKER_LABELING_UNSPECIFIED
  • SPEAKER_LABELING_ENABLED: Enable speaker labeling
  • SPEAKER_LABELING_DISABLED: Disable speaker labeling

AudioChunk

Data chunk with audio.

Field

Description

data

bytes

Bytes with audio data.

SilenceChunk

Data chunk with silence.

Field

Description

duration_ms

int64

Duration of silence chunk in ms.

Eou

Force EOU

Field

Description

Empty

StreamingResponse

{
  "session_uuid": {
    "uuid": "string",
    "user_request_id": "string"
  },
  "audio_cursors": {
    "received_data_ms": "int64",
    "reset_time_ms": "int64",
    "partial_time_ms": "int64",
    "final_time_ms": "int64",
    "final_index": "int64",
    "eou_time_ms": "int64"
  },
  "response_wall_time_ms": "int64",
  // Includes only one of the fields `partial`, `final`, `eou_update`, `final_refinement`, `status_code`, `classifier_update`, `speaker_analysis`, `conversation_analysis`
  "partial": {
    "alternatives": [
      {
        "words": [
          {
            "text": "string",
            "start_time_ms": "int64",
            "end_time_ms": "int64"
          }
        ],
        "text": "string",
        "start_time_ms": "int64",
        "end_time_ms": "int64",
        "confidence": "double",
        "languages": [
          {
            "language_code": "string",
            "probability": "double"
          }
        ]
      }
    ],
    "channel_tag": "string"
  },
  "final": {
    "alternatives": [
      {
        "words": [
          {
            "text": "string",
            "start_time_ms": "int64",
            "end_time_ms": "int64"
          }
        ],
        "text": "string",
        "start_time_ms": "int64",
        "end_time_ms": "int64",
        "confidence": "double",
        "languages": [
          {
            "language_code": "string",
            "probability": "double"
          }
        ]
      }
    ],
    "channel_tag": "string"
  },
  "eou_update": {
    "time_ms": "int64"
  },
  "final_refinement": {
    "final_index": "int64",
    // Includes only one of the fields `normalized_text`
    "normalized_text": {
      "alternatives": [
        {
          "words": [
            {
              "text": "string",
              "start_time_ms": "int64",
              "end_time_ms": "int64"
            }
          ],
          "text": "string",
          "start_time_ms": "int64",
          "end_time_ms": "int64",
          "confidence": "double",
          "languages": [
            {
              "language_code": "string",
              "probability": "double"
            }
          ]
        }
      ],
      "channel_tag": "string"
    }
    // end of the list of possible fields
  },
  "status_code": {
    "code_type": "CodeType",
    "message": "string"
  },
  "classifier_update": {
    "window_type": "WindowType",
    "start_time_ms": "int64",
    "end_time_ms": "int64",
    "classifier_result": {
      "classifier": "string",
      "highlights": [
        {
          "text": "string",
          "start_time_ms": "int64",
          "end_time_ms": "int64"
        }
      ],
      "labels": [
        {
          "label": "string",
          "confidence": "double"
        }
      ]
    }
  },
  "speaker_analysis": {
    "speaker_tag": "string",
    "window_type": "WindowType",
    "speech_boundaries": {
      "start_time_ms": "int64",
      "end_time_ms": "int64"
    },
    "total_speech_ms": "int64",
    "speech_ratio": "double",
    "total_silence_ms": "int64",
    "silence_ratio": "double",
    "words_count": "int64",
    "letters_count": "int64",
    "words_per_second": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "letters_per_second": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "words_per_utterance": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "letters_per_utterance": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "utterance_count": "int64",
    "utterance_duration_estimation": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    }
  },
  "conversation_analysis": {
    "conversation_boundaries": {
      "start_time_ms": "int64",
      "end_time_ms": "int64"
    },
    "total_simultaneous_silence_duration_ms": "int64",
    "total_simultaneous_silence_ratio": "double",
    "simultaneous_silence_duration_estimation": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "total_simultaneous_speech_duration_ms": "int64",
    "total_simultaneous_speech_ratio": "double",
    "simultaneous_speech_duration_estimation": {
      "min": "double",
      "max": "double",
      "mean": "double",
      "std": "double",
      "quantiles": [
        {
          "level": "double",
          "value": "double"
        }
      ]
    },
    "speaker_interrupts": [
      {
        "speaker_tag": "string",
        "interrupts_count": "int64",
        "interrupts_duration_ms": "int64",
        "interrupts": [
          {
            "start_time_ms": "int64",
            "end_time_ms": "int64"
          }
        ]
      }
    ],
    "total_speech_duration_ms": "int64",
    "total_speech_ratio": "double"
  },
  // end of the list of possible fields
  "channel_tag": "string"
}

Responses from server.
Each response contains session uuid
AudioCursors
plus specific event

Field

Description

session_uuid

SessionUuid

Session identifier

audio_cursors

AudioCursors

Progress bar for stream session recognition: how many data we obtained; final and partial times; etc.

response_wall_time_ms

int64

Wall clock on server side. This is time when server wrote results to stream

partial

AlternativeUpdate

Partial results, server will send them regularly after enough audio data was received from user. This are current text estimation
from final_time_ms to partial_time_ms. Could change after new data will arrive.

Includes only one of the fields partial, final, eou_update, final_refinement, status_code, classifier_update, speaker_analysis, conversation_analysis.

final

AlternativeUpdate

Final results, the recognition is now fixed until final_time_ms. For now, final is sent only if the EOU event was triggered. This could be change in future releases.

Includes only one of the fields partial, final, eou_update, final_refinement, status_code, classifier_update, speaker_analysis, conversation_analysis.

eou_update

EouUpdate

After EOU classifier, send the message with final, send the EouUpdate with time of EOU
before eou_update we send final with the same time. there could be several finals before eou update.

Includes only one of the fields partial, final, eou_update, final_refinement, status_code, classifier_update, speaker_analysis, conversation_analysis.

final_refinement

FinalRefinement

For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing).
Final normalization will introduce additional latency.

Includes only one of the fields partial, final, eou_update, final_refinement, status_code, classifier_update, speaker_analysis, conversation_analysis.

status_code

StatusCode

Status messages, send by server with fixed interval (keep-alive).

Includes only one of the fields partial, final, eou_update, final_refinement, status_code, classifier_update, speaker_analysis, conversation_analysis.

classifier_update

RecognitionClassifierUpdate

Result of the triggered classifier

Includes only one of the fields partial, final, eou_update, final_refinement, status_code, classifier_update, speaker_analysis, conversation_analysis.

speaker_analysis

SpeakerAnalysis

Speech statistics for every speaker

Includes only one of the fields partial, final, eou_update, final_refinement, status_code, classifier_update, speaker_analysis, conversation_analysis.

conversation_analysis

ConversationAnalysis

Conversation statistics

Includes only one of the fields partial, final, eou_update, final_refinement, status_code, classifier_update, speaker_analysis, conversation_analysis.

channel_tag

string

Tag for distinguish audio channels.

SessionUuid

Session identifier.

Field

Description

uuid

string

Internal session identifier.

user_request_id

string

User session identifier.

AudioCursors

AudioCursors are state of ASR recognition stream.

Field

Description

received_data_ms

int64

Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server.

reset_time_ms

int64

Input stream reset data.

partial_time_ms

int64

How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data
to update recognition results (includes silence as well).

final_time_ms

int64

Time of last final. This cursor is moved when server decides that recognition from start of audio until final_time_ms will not change anymore
usually this even is followed by EOU detection (but this could change in future).

final_index

int64

This is index of last final server send. Incremented after each new final.

eou_time_ms

int64

Estimated time of EOU. Cursor is updated after each new EOU is sent.
For external classifier this equals to received_data_ms at the moment EOU event arrives.
For internal classifier this is estimation of time. The time is not exact and has the same guarantees as word timings.

AlternativeUpdate

Update of hypothesis.

Field

Description

alternatives[]

Alternative

List of hypothesis for timeframes.

channel_tag

string

Alternative

Recognition of specific time frame.

Field

Description

words[]

Word

Words in time frame.

text

string

Text in time frame.

start_time_ms

int64

Start of time frame.

end_time_ms

int64

End of time frame.

confidence

double

The hypothesis confidence. Currently is not used.

languages[]

LanguageEstimation

Distribution over possible languages.

Word

Recognized word.

Field

Description

text

string

Word text.

start_time_ms

int64

Estimation of word start time in ms.

end_time_ms

int64

Estimation of word end time in ms.

LanguageEstimation

Estimation of language and its probability.

Field

Description

language_code

string

Language code in ISO 639-1 format.

probability

double

Estimation of language probability.

EouUpdate

Update information for external End of Utterance.

Field

Description

time_ms

int64

EOU estimated time.

FinalRefinement

Refinement for final hypo. For example, text normalization is refinement.

Field

Description

final_index

int64

Index of final for which server sends additional information.

normalized_text

AlternativeUpdate

Normalized text instead of raw one.

Includes only one of the fields normalized_text.

Type of refinement.

StatusCode

Status message

Field

Description

code_type

enum CodeType

Code type.

  • CODE_TYPE_UNSPECIFIED
  • WORKING: All good.
  • WARNING: For example, if speech is sent not in real time or context is unknown and we've made fallback.
  • CLOSED: After session was closed.

message

string

Human readable message.

RecognitionClassifierUpdate

Field

Description

window_type

enum WindowType

Response window type

  • WINDOW_TYPE_UNSPECIFIED
  • LAST_UTTERANCE: The result of applying the classifier to the last utterance response
  • LAST_FINAL: The result of applying the classifier to the last final response
  • LAST_PARTIAL: The result of applying the classifier to the last partial response

start_time_ms

int64

Start time of the audio segment used for classification

end_time_ms

int64

End time of the audio segment used for classification

classifier_result

RecognitionClassifierResult

Result for dictionary-based classifier

RecognitionClassifierResult

Field

Description

classifier

string

Name of the triggered classifier

highlights[]

PhraseHighlight

List of highlights, i.e. parts of phrase that determine the result of the classification

labels[]

RecognitionClassifierLabel

Classifier predictions

PhraseHighlight

Field

Description

text

string

Text transcription of the highlighted audio segment

start_time_ms

int64

Start time of the highlighted audio segment

end_time_ms

int64

End time of the highlighted audio segment

RecognitionClassifierLabel

Field

Description

label

string

The label of the class predicted by the classifier

confidence

double

The prediction confidence

SpeakerAnalysis

Field

Description

speaker_tag

string

Speaker tag

window_type

enum WindowType

Response window type

  • WINDOW_TYPE_UNSPECIFIED
  • TOTAL: Stats for all received audio.
  • LAST_UTTERANCE: Stats for last utterance.

speech_boundaries

AudioSegmentBoundaries

Audio segment boundaries

total_speech_ms

int64

Total speech duration

speech_ratio

double

Speech ratio within audio segment

total_silence_ms

int64

Total silence duration

silence_ratio

double

Silence ratio within audio segment

words_count

int64

Number of words in recognized speech

letters_count

int64

Number of letters in recognized speech

words_per_second

DescriptiveStatistics

Descriptive statistics for words per second distribution

letters_per_second

DescriptiveStatistics

Descriptive statistics for letters per second distribution

words_per_utterance

DescriptiveStatistics

Descriptive statistics for words per utterance distribution

letters_per_utterance

DescriptiveStatistics

Descriptive statistics for letters per utterance distribution

utterance_count

int64

Number of utterances

utterance_duration_estimation

DescriptiveStatistics

Descriptive statistics for utterance duration distribution

AudioSegmentBoundaries

Field

Description

start_time_ms

int64

Audio segment start time

end_time_ms

int64

Audio segment end time

DescriptiveStatistics

Field

Description

min

double

Minimum observed value

max

double

Maximum observed value

mean

double

Estimated mean of distribution

std

double

Estimated standard deviation of distribution

quantiles[]

Quantile

List of evaluated quantiles

Quantile

Field

Description

level

double

Quantile level in range (0, 1)

value

double

Quantile value

ConversationAnalysis

Field

Description

conversation_boundaries

AudioSegmentBoundaries

Audio segment boundaries

total_simultaneous_silence_duration_ms

int64

Total simultaneous silence duration

total_simultaneous_silence_ratio

double

Simultaneous silence ratio within audio segment

simultaneous_silence_duration_estimation

DescriptiveStatistics

Descriptive statistics for simultaneous silence duration distribution

total_simultaneous_speech_duration_ms

int64

Total simultaneous speech duration

total_simultaneous_speech_ratio

double

Simultaneous speech ratio within audio segment

simultaneous_speech_duration_estimation

DescriptiveStatistics

Descriptive statistics for simultaneous speech duration distribution

speaker_interrupts[]

InterruptsEvaluation

Interrupts description for every speaker

total_speech_duration_ms

int64

Total speech duration, including both simultaneous and separate speech

total_speech_ratio

double

Total speech ratio within audio segment

InterruptsEvaluation

Field

Description

speaker_tag

string

Speaker tag

interrupts_count

int64

Number of interrupts made by the speaker

interrupts_duration_ms

int64

Total duration of all interrupts

interrupts[]

AudioSegmentBoundaries

Boundaries for every interrupt

Was the article helpful?

Previous
Overview
Next
Overview
Yandex project
© 2025 Yandex.Cloud LLC