Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • Overview
    • API authentication
      • API v1
        • Overview
          • Overview
          • UtteranceSynthesis
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy

In this article:

  • HTTP request
  • Body parameters
  • TextTemplate
  • TextVariable
  • Hints
  • AudioTemplate
  • AudioContent
  • AudioFormatOptions
  • RawAudio
  • ContainerAudio
  • AudioVariable
  • DurationHint
  • Response
  • AudioChunk
  • TextChunk
  1. API references
  2. Speech synthesis
  3. API v3 REST
  4. Synthesizer
  5. UtteranceSynthesis

SpeechKit Synthesis Service API v3, REST: Synthesizer.UtteranceSynthesis

Written by
Yandex Cloud
Updated at February 24, 2025
  • HTTP request
  • Body parameters
  • TextTemplate
  • TextVariable
  • Hints
  • AudioTemplate
  • AudioContent
  • AudioFormatOptions
  • RawAudio
  • ContainerAudio
  • AudioVariable
  • DurationHint
  • Response
  • AudioChunk
  • TextChunk

Synthesizing text into speech.

HTTP request

POST https://tts.api.cloud.yandex.net/tts/v3/utteranceSynthesis

Body parameters

{
  "model": "string",
  // Includes only one of the fields `text`, `textTemplate`
  "text": "string",
  "textTemplate": {
    "textTemplate": "string",
    "variables": [
      {
        "variableName": "string",
        "variableValue": "string"
      }
    ]
  },
  // end of the list of possible fields
  "hints": [
    {
      // Includes only one of the fields `voice`, `audioTemplate`, `speed`, `volume`, `role`, `pitchShift`, `duration`
      "voice": "string",
      "audioTemplate": {
        "audio": {
          // Includes only one of the fields `content`
          "content": "string",
          // end of the list of possible fields
          "audioSpec": {
            // Includes only one of the fields `rawAudio`, `containerAudio`
            "rawAudio": {
              "audioEncoding": "string",
              "sampleRateHertz": "string"
            },
            "containerAudio": {
              "containerAudioType": "string"
            }
            // end of the list of possible fields
          }
        },
        "textTemplate": {
          "textTemplate": "string",
          "variables": [
            {
              "variableName": "string",
              "variableValue": "string"
            }
          ]
        },
        "variables": [
          {
            "variableName": "string",
            "variableStartMs": "string",
            "variableLengthMs": "string"
          }
        ]
      },
      "speed": "string",
      "volume": "string",
      "role": "string",
      "pitchShift": "string",
      "duration": {
        "policy": "string",
        "durationMs": "string"
      }
      // end of the list of possible fields
    }
  ],
  "outputAudioSpec": {
    // Includes only one of the fields `rawAudio`, `containerAudio`
    "rawAudio": {
      "audioEncoding": "string",
      "sampleRateHertz": "string"
    },
    "containerAudio": {
      "containerAudioType": "string"
    }
    // end of the list of possible fields
  },
  "loudnessNormalizationType": "string",
  "unsafeMode": "boolean"
}

Field

Description

model

string

The name of the model.
Specifies basic synthesis functionality. Currently should be empty. Do not use it.

text

string

Raw text (e.g. "Hello, Alice").

Includes only one of the fields text, textTemplate.

Text to synthesis, one of text synthesis markups.

textTemplate

TextTemplate

Text template instance, e.g. {"Hello, {username}" with username="Alice"}.

Includes only one of the fields text, textTemplate.

Text to synthesis, one of text synthesis markups.

hints[]

Hints

Optional hints for synthesis.

outputAudioSpec

AudioFormatOptions

Optional. Default: 22050 Hz, linear 16-bit signed little-endian PCM, with WAV header

loudnessNormalizationType

enum (LoudnessNormalizationType)

Specifies type of loudness normalization.
Optional. Default: LUFS.

  • LOUDNESS_NORMALIZATION_TYPE_UNSPECIFIED
  • MAX_PEAK: The type of normalization, wherein the gain is changed to bring the highest PCM sample value or analog signal peak to a given level.
  • LUFS: The type of normalization based on EBU R 128 recommendation.

unsafeMode

boolean

Optional. Automatically split long text to several utterances and bill accordingly. Some degradation in service quality is possible.

TextTemplate

Field

Description

textTemplate

string

Template text.

Sample:The {animal} goes to the {place}.

variables[]

TextVariable

Defining variables in template text.

Sample: {animal: cat, place: forest}

TextVariable

Field

Description

variableName

string

The name of the variable.

variableValue

string

The text of the variable.

Hints

Field

Description

voice

string

Name of speaker to use.

Includes only one of the fields voice, audioTemplate, speed, volume, role, pitchShift, duration.

The hint for TTS engine to specify synthesised audio characteristics.

audioTemplate

AudioTemplate

Template for synthesizing.

Includes only one of the fields voice, audioTemplate, speed, volume, role, pitchShift, duration.

The hint for TTS engine to specify synthesised audio characteristics.

speed

string

Hint to change speed.

Includes only one of the fields voice, audioTemplate, speed, volume, role, pitchShift, duration.

The hint for TTS engine to specify synthesised audio characteristics.

volume

string

Hint to regulate normalization level.

  • For MAX_PEAK loudness_normalization_type: volume changes in a range (0;1], default value is 0.7.
  • For LUFS loudness_normalization_type: volume changes in a range [-145;0), default value is -19.

Includes only one of the fields voice, audioTemplate, speed, volume, role, pitchShift, duration.

The hint for TTS engine to specify synthesised audio characteristics.

role

string

Hint to specify pronunciation character for the speaker.

Includes only one of the fields voice, audioTemplate, speed, volume, role, pitchShift, duration.

The hint for TTS engine to specify synthesised audio characteristics.

pitchShift

string

Hint to increase (or decrease) speaker's pitch, measured in Hz. Valid values are in range [-1000;1000], default value is 0.

Includes only one of the fields voice, audioTemplate, speed, volume, role, pitchShift, duration.

The hint for TTS engine to specify synthesised audio characteristics.

duration

DurationHint

Hint to limit both minimum and maximum audio duration.

Includes only one of the fields voice, audioTemplate, speed, volume, role, pitchShift, duration.

The hint for TTS engine to specify synthesised audio characteristics.

AudioTemplate

Field

Description

audio

AudioContent

Audio file.

textTemplate

TextTemplate

Template and description of its variables.

variables[]

AudioVariable

Describing variables in audio.

AudioContent

Field

Description

content

string (bytes)

Bytes with audio data.

Includes only one of the fields content.

The audio source to read the data from.

audioSpec

AudioFormatOptions

Description of the audio format.

AudioFormatOptions

Field

Description

rawAudio

RawAudio

The audio format specified in request parameters.

Includes only one of the fields rawAudio, containerAudio.

containerAudio

ContainerAudio

The audio format specified inside the container metadata.

Includes only one of the fields rawAudio, containerAudio.

RawAudio

Field

Description

audioEncoding

enum (AudioEncoding)

Encoding type.

  • AUDIO_ENCODING_UNSPECIFIED
  • LINEAR16_PCM: Audio bit depth 16-bit signed little-endian (Linear PCM).

sampleRateHertz

string (int64)

Sampling frequency of the signal.

ContainerAudio

Field

Description

containerAudioType

enum (ContainerAudioType)

  • CONTAINER_AUDIO_TYPE_UNSPECIFIED
  • WAV: Audio bit depth 16-bit signed little-endian (Linear PCM).
  • OGG_OPUS: Data is encoded using the OPUS audio codec and compressed using the OGG container format.
  • MP3: Data is encoded using MPEG-1/2 Layer III and compressed using the MP3 container format.

AudioVariable

Field

Description

variableName

string

The name of the variable.

variableStartMs

string (int64)

Start time of the variable in milliseconds.

variableLengthMs

string (int64)

Length of the variable in milliseconds.

DurationHint

Field

Description

policy

enum (DurationHintPolicy)

Type of duration constraint.

  • DURATION_HINT_POLICY_UNSPECIFIED
  • EXACT_DURATION: Limit audio duration to exact value.
  • MIN_DURATION: Limit the minimum audio duration.
  • MAX_DURATION: Limit the maximum audio duration.

durationMs

string (int64)

Constraint on audio duration in milliseconds.

Response

HTTP Code: 200 - OK

{
  "audioChunk": {
    "data": "string"
  },
  "textChunk": {
    "text": "string"
  },
  "startMs": "string",
  "lengthMs": "string"
}

Field

Description

audioChunk

AudioChunk

Part of synthesized audio.

textChunk

TextChunk

Part of synthesized text.

startMs

string (int64)

Start time of the audio chunk in milliseconds.

lengthMs

string (int64)

Length of the audio chunk in milliseconds.

AudioChunk

Field

Description

data

string (bytes)

Sequence of bytes of the synthesized audio in format specified in output_audio_spec.

TextChunk

Field

Description

text

string

Synthesized text.

Was the article helpful?

Previous
Overview
Next
About the Python SDK
© 2025 Direct Cursus Technology L.L.C.