Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • Overview
    • API authentication
      • API v1
        • Overview
          • Overview
          • UtteranceSynthesis
          • StreamSynthesis
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy
  • Audit Trails events

In this article:

  • gRPC request
  • StreamSynthesisRequest
  • SynthesisOptions
  • AudioFormatOptions
  • RawAudio
  • ContainerAudio
  • SynthesisInput
  • ForceSynthesisEvent
  • StreamSynthesisResponse
  • AudioChunk
  • TextChunk
  1. API references
  2. Speech synthesis
  3. gRPC API v3
  4. Synthesizer
  5. StreamSynthesis

SpeechKit Synthesis Service API v3, gRPC: Synthesizer.StreamSynthesis

Written by
Yandex Cloud
Updated at August 8, 2025
  • gRPC request
  • StreamSynthesisRequest
  • SynthesisOptions
  • AudioFormatOptions
  • RawAudio
  • ContainerAudio
  • SynthesisInput
  • ForceSynthesisEvent
  • StreamSynthesisResponse
  • AudioChunk
  • TextChunk

Bidirectional streaming RPC for real-time synthesis.

gRPC requestgRPC request

rpc StreamSynthesis (stream StreamSynthesisRequest) returns (stream StreamSynthesisResponse)

StreamSynthesisRequestStreamSynthesisRequest

{
  // Includes only one of the fields `options`, `synthesis_input`, `force_synthesis`
  "options": {
    "model": "string",
    "voice": "string",
    "role": "string",
    "speed": "double",
    "volume": "double",
    "pitch_shift": "double",
    "output_audio_spec": {
      // Includes only one of the fields `raw_audio`, `container_audio`
      "raw_audio": {
        "audio_encoding": "AudioEncoding",
        "sample_rate_hertz": "int64"
      },
      "container_audio": {
        "container_audio_type": "ContainerAudioType"
      }
      // end of the list of possible fields
    },
    "loudness_normalization_type": "LoudnessNormalizationType"
  },
  "synthesis_input": {
    "text": "string"
  },
  "force_synthesis": "ForceSynthesisEvent"
  // end of the list of possible fields
}

Sent by client to control or provide data during streaming synthesis.

Field

Description

options

SynthesisOptions

Synthesis options. Must be provided in the first request of the stream and cannot be updated afterwards.

Includes only one of the fields options, synthesis_input, force_synthesis.

synthesis_input

SynthesisInput

Input to be synthesized.

Includes only one of the fields options, synthesis_input, force_synthesis.

force_synthesis

ForceSynthesisEvent

Triggers immediate synthesis of buffered input.

Includes only one of the fields options, synthesis_input, force_synthesis.

SynthesisOptionsSynthesisOptions

Field

Description

model

string

The name of the TTS model to use for synthesis. Currently should be empty. Do not use it.

voice

string

The voice to use for speech synthesis.

role

string

The role or speaking style. Can be used to specify pronunciation character for the speaker.

speed

double

Speed multiplier (default: 1.0).

volume

double

Volume adjustment:

  • For MAX_PEAK: range is (0, 1], default 0.7.
  • For LUFS: range is [-145, 0), default -19.

pitch_shift

double

Pitch adjustment, in Hz, range [-1000, 1000], default 0.

output_audio_spec

AudioFormatOptions

Specifies output audio format. Default: 22050Hz, linear 16-bit signed little-endian PCM, with WAV header.

loudness_normalization_type

enum LoudnessNormalizationType

Loudness normalization type for output (default: LUFS).

  • LOUDNESS_NORMALIZATION_TYPE_UNSPECIFIED: Unspecified loudness normalization. The default behavior will be used.
  • MAX_PEAK: The type of normalization, wherein the gain is changed to bring the highest PCM sample value or analog signal peak to a given level.
  • LUFS: The type of normalization based on EBU R 128 recommendation.

AudioFormatOptionsAudioFormatOptions

Field

Description

raw_audio

RawAudio

The audio format specified in request parameters.

Includes only one of the fields raw_audio, container_audio.

container_audio

ContainerAudio

The audio format specified inside the container metadata.

Includes only one of the fields raw_audio, container_audio.

RawAudioRawAudio

Field

Description

audio_encoding

enum AudioEncoding

Encoding type.

  • AUDIO_ENCODING_UNSPECIFIED
  • LINEAR16_PCM: Audio bit depth 16-bit signed little-endian (Linear PCM).

sample_rate_hertz

int64

Sampling frequency of the signal.

ContainerAudioContainerAudio

Field

Description

container_audio_type

enum ContainerAudioType

  • CONTAINER_AUDIO_TYPE_UNSPECIFIED
  • WAV: Audio bit depth 16-bit signed little-endian (Linear PCM).
  • OGG_OPUS: Data is encoded using the OPUS audio codec and compressed using the OGG container format.
  • MP3: Data is encoded using MPEG-1/2 Layer III and compressed using the MP3 container format.

SynthesisInputSynthesisInput

The input for synthesis.

Field

Description

text

string

The text string to be synthesized.

ForceSynthesisEventForceSynthesisEvent

Event to forcibly trigger synthesis.

Field

Description

Empty

StreamSynthesisResponseStreamSynthesisResponse

{
  "audio_chunk": {
    "data": "bytes"
  },
  "text_chunk": {
    "text": "string"
  },
  "start_ms": "int64",
  "length_ms": "int64"
}

Field

Description

audio_chunk

AudioChunk

Part of synthesized audio.

text_chunk

TextChunk

Part of synthesized text.

start_ms

int64

Start time of the audio chunk in milliseconds.

length_ms

int64

Length of the audio chunk in milliseconds.

AudioChunkAudioChunk

Field

Description

data

bytes

Sequence of bytes of the synthesized audio in format specified in output_audio_spec.

TextChunkTextChunk

Field

Description

text

string

Synthesized text.

Was the article helpful?

Previous
UtteranceSynthesis
Next
Overview
© 2025 Direct Cursus Technology L.L.C.