SpeechKit Synthesis Service API v3, gRPC: Synthesizer.StreamSynthesis

Written by

Yandex Cloud

Updated at August 8, 2025

gRPC request
StreamSynthesisRequest
SynthesisOptions
AudioFormatOptions
RawAudio
ContainerAudio
SynthesisInput
ForceSynthesisEvent
StreamSynthesisResponse
AudioChunk
TextChunk

Bidirectional streaming RPC for real-time synthesis.

gRPC request

rpc StreamSynthesis (stream StreamSynthesisRequest) returns (stream StreamSynthesisResponse)

StreamSynthesisRequest

{
  // Includes only one of the fields `options`, `synthesis_input`, `force_synthesis`
  "options": {
    "model": "string",
    "voice": "string",
    "role": "string",
    "speed": "double",
    "volume": "double",
    "pitch_shift": "double",
    "output_audio_spec": {
      // Includes only one of the fields `raw_audio`, `container_audio`
      "raw_audio": {
        "audio_encoding": "AudioEncoding",
        "sample_rate_hertz": "int64"
      },
      "container_audio": {
        "container_audio_type": "ContainerAudioType"
      }
      // end of the list of possible fields
    },
    "loudness_normalization_type": "LoudnessNormalizationType"
  },
  "synthesis_input": {
    "text": "string"
  },
  "force_synthesis": "ForceSynthesisEvent"
  // end of the list of possible fields
}

Sent by client to control or provide data during streaming synthesis.

Field	Description
options	SynthesisOptions Synthesis options. Must be provided in the first request of the stream and cannot be updated afterwards. Includes only one of the fields `options`, `synthesis_input`, `force_synthesis`.
synthesis_input	SynthesisInput Input to be synthesized. Includes only one of the fields `options`, `synthesis_input`, `force_synthesis`.
force_synthesis	ForceSynthesisEvent Triggers immediate synthesis of buffered input. Includes only one of the fields `options`, `synthesis_input`, `force_synthesis`.

SynthesisOptions

Field	Description
model	string The name of the TTS model to use for synthesis. Currently should be empty. Do not use it.
voice	string The voice to use for speech synthesis.
role	string The role or speaking style. Can be used to specify pronunciation character for the speaker.
speed	double Speed multiplier (default: 1.0).
volume	double Volume adjustment: For `MAX_PEAK`: range is (0, 1], default 0.7. For `LUFS`: range is [-145, 0), default -19.
pitch_shift	double Pitch adjustment, in Hz, range [-1000, 1000], default 0.
output_audio_spec	AudioFormatOptions Specifies output audio format. Default: 22050Hz, linear 16-bit signed little-endian PCM, with WAV header.
loudness_normalization_type	enum LoudnessNormalizationType Loudness normalization type for output (default: `LUFS`). `LOUDNESS_NORMALIZATION_TYPE_UNSPECIFIED`: Unspecified loudness normalization. The default behavior will be used. `MAX_PEAK`: The type of normalization, wherein the gain is changed to bring the highest PCM sample value or analog signal peak to a given level. `LUFS`: The type of normalization based on EBU R 128 recommendation.

AudioFormatOptions

Field

Description

raw_audio

RawAudio

The audio format specified in request parameters.

Includes only one of the fields raw_audio, container_audio.

container_audio

ContainerAudio

The audio format specified inside the container metadata.

Includes only one of the fields raw_audio, container_audio.

RawAudio

Field

Description

audio_encoding

enum AudioEncoding

Encoding type.

AUDIO_ENCODING_UNSPECIFIED
LINEAR16_PCM: Audio bit depth 16-bit signed little-endian (Linear PCM).

sample_rate_hertz

int64

Sampling frequency of the signal.

ContainerAudio

Field

Description

container_audio_type

enum ContainerAudioType

CONTAINER_AUDIO_TYPE_UNSPECIFIED
WAV: Audio bit depth 16-bit signed little-endian (Linear PCM).
OGG_OPUS: Data is encoded using the OPUS audio codec and compressed using the OGG container format.
MP3: Data is encoded using MPEG-1/2 Layer III and compressed using the MP3 container format.

SynthesisInput

The input for synthesis.

Field

Description

text

string

The text string to be synthesized.

ForceSynthesisEvent

Event to forcibly trigger synthesis.

Field	Description
Empty

StreamSynthesisResponse

{
  "audio_chunk": {
    "data": "bytes"
  },
  "text_chunk": {
    "text": "string"
  },
  "start_ms": "int64",
  "length_ms": "int64"
}

Field	Description
audio_chunk	AudioChunk Part of synthesized audio.
text_chunk	TextChunk Part of synthesized text.
start_ms	int64 Start time of the audio chunk in milliseconds.
length_ms	int64 Length of the audio chunk in milliseconds.

AudioChunk

Field

Description

data

bytes

Sequence of bytes of the synthesized audio in format specified in output_audio_spec.

TextChunk

Field

Description

text

string

Synthesized text.

SpeechKit Synthesis Service API v3, gRPC: Synthesizer.StreamSynthesis

gRPC requestgRPC request

StreamSynthesisRequestStreamSynthesisRequest

SynthesisOptionsSynthesisOptions

AudioFormatOptionsAudioFormatOptions

RawAudioRawAudio

ContainerAudioContainerAudio

SynthesisInputSynthesisInput

ForceSynthesisEventForceSynthesisEvent

StreamSynthesisResponseStreamSynthesisResponse

AudioChunkAudioChunk

TextChunkTextChunk

Was the article helpful?

gRPC request

StreamSynthesisRequest

SynthesisOptions

AudioFormatOptions

RawAudio

ContainerAudio

SynthesisInput

ForceSynthesisEvent

StreamSynthesisResponse

AudioChunk

TextChunk