Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit Hybrid
  • System requirements
  • Service architecture
  • API authentication
      • Overview
      • Synthesizer
  • Pricing policy
  • SpeechKit Hybrid releases

In this article:

  • Calls Synthesizer
  • UtteranceSynthesis
  • UtteranceSynthesisRequest
  • TextTemplate
  • TextVariable
  • Hints
  • AudioTemplate
  • AudioContent
  • AudioVariable
  • DurationHint
  • AudioFormatOptions
  • RawAudio
  • ContainerAudio
  • UtteranceSynthesisResponse
  • AudioChunk
  • TextChunk
  1. gRPC API reference
  2. Speech synthesis
  3. Synthesizer

SpeechKit Hybrid Synthesis Service API, gRPC: Synthesizer

Written by
Yandex Cloud
Updated at January 16, 2024
  • Calls Synthesizer
  • UtteranceSynthesis
    • UtteranceSynthesisRequest
    • TextTemplate
    • TextVariable
    • Hints
    • AudioTemplate
    • AudioContent
    • AudioVariable
    • DurationHint
    • AudioFormatOptions
    • RawAudio
    • ContainerAudio
    • UtteranceSynthesisResponse
    • AudioChunk
    • TextChunk

A set of methods for voice synthesis.

Call Description
UtteranceSynthesis Synthesizing text into speech.

Calls SynthesizerCalls Synthesizer

UtteranceSynthesisUtteranceSynthesis

Synthesizing text into speech.

rpc UtteranceSynthesis (UtteranceSynthesisRequest) returns (stream UtteranceSynthesisResponse)

UtteranceSynthesisRequestUtteranceSynthesisRequest

Field Description
model string
The name of the model. Specifies basic synthesis functionality. Currently should be empty. Do not use it.
Utterance oneof: text or text_template
Text to synthesis, one of text synthesis markups.
  text string
Raw text (e.g. "Hello, Alice").
  text_template TextTemplate
Text template instance, e.g. {"Hello, {username}" with username="Alice"}.
hints[] Hints
Optional hints for synthesis.
output_audio_spec AudioFormatOptions
Optional. Default: 22050 Hz, linear 16-bit signed little-endian PCM, with WAV header
loudness_normalization_type enum LoudnessNormalizationType
Specifies type of loudness normalization. Optional. Default: LUFS.
  • MAX_PEAK: The type of normalization, wherein the gain is changed to bring the highest PCM sample value or analog signal peak to a given level.
  • LUFS: The type of normalization based on EBU R 128 recommendation.
unsafe_mode bool
Optional. Automatically split long text to several utterances and bill accordingly. Some degradation in service quality is possible.

TextTemplateTextTemplate

Field Description
text_template string
Template text.
Sample:The {animal} goes to the {place}.
variables[] TextVariable
Defining variables in template text.
Sample: {animal: cat, place: forest}

TextVariableTextVariable

Field Description
variable_name string
The name of the variable.
variable_value string
The text of the variable.

HintsHints

Field Description
Hint oneof: voice, audio_template, speed, volume, role, pitch_shift or duration
The hint for TTS engine to specify synthesised audio characteristics.
  voice string
Name of speaker to use.
  audio_template AudioTemplate
Template for synthesizing.
  speed double
Hint to change speed.
  volume double
Hint to regulate normalization level.
  • For MAX_PEAK loudness_normalization_type: volume changes in a range (0;1], default value is 0.7.
  • For LUFS loudness_normalization_type: volume changes in a range [-145;0), default value is -19.
  role string
Hint to specify pronunciation character for the speaker.
  pitch_shift double
Hint to increase (or decrease) speaker's pitch, measured in Hz. Valid values are in range [-1000;1000], default value is 0.
  duration DurationHint
Hint to limit both minimum and maximum audio duration.

AudioTemplateAudioTemplate

Field Description
audio AudioContent
Audio file.
text_template TextTemplate
Template and description of its variables.
variables[] AudioVariable
Describing variables in audio.

AudioContentAudioContent

Field Description
AudioSource oneof: content
The audio source to read the data from.
  content bytes
Bytes with audio data.
audio_spec AudioFormatOptions
Description of the audio format.

AudioVariableAudioVariable

Field Description
variable_name string
The name of the variable.
variable_start_ms int64
Start time of the variable in milliseconds.
variable_length_ms int64
Length of the variable in milliseconds.

DurationHintDurationHint

Field Description
policy enum DurationHintPolicy
Type of duration constraint.
  • EXACT_DURATION: Limit audio duration to exact value.
  • MIN_DURATION: Limit the minimum audio duration.
  • MAX_DURATION: Limit the maximum audio duration.
duration_ms int64
Constraint on audio duration in milliseconds.

AudioFormatOptionsAudioFormatOptions

Field Description
AudioFormat oneof: raw_audio or container_audio
  raw_audio RawAudio
The audio format specified in request parameters.
  container_audio ContainerAudio
The audio format specified inside the container metadata.

RawAudioRawAudio

Field Description
audio_encoding enum AudioEncoding
Encoding type.
  • LINEAR16_PCM: Audio bit depth 16-bit signed little-endian (Linear PCM).
sample_rate_hertz int64
Sampling frequency of the signal.

ContainerAudioContainerAudio

Field Description
container_audio_type enum ContainerAudioType
  • WAV: Audio bit depth 16-bit signed little-endian (Linear PCM).
  • OGG_OPUS: Data is encoded using the OPUS audio codec and compressed using the OGG container format.
  • MP3: Data is encoded using MPEG-1/2 Layer III and compressed using the MP3 container format.

UtteranceSynthesisResponseUtteranceSynthesisResponse

Field Description
audio_chunk AudioChunk
Part of synthesized audio.
text_chunk TextChunk
Part of synthesized text.
start_ms int64
Start time of the audio chunk in milliseconds.
length_ms int64
Length of the audio chunk in milliseconds.

AudioChunkAudioChunk

Field Description
data bytes
Sequence of bytes of the synthesized audio in format specified in output_audio_spec.

TextChunkTextChunk

Field Description
text string
Synthesized text.

Was the article helpful?

Previous
Overview
Next
Overview
© 2025 Direct Cursus Technology L.L.C.