Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Yandex SpeechKit
  • SpeechKit technology overview
    • About the technology
    • List of voices
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy

In this article:

  • Synthesis options
  • Languages and voices
  • Role
  • Voice tone
  • Controlling pronunciation
  • Synthesis settings
  • Synthesized speech speed
  • Volume normalization
  • Synthesized audio file format
  • Use cases
  1. Speech synthesis
  2. About the technology

Speech synthesis

Written by
Yandex Cloud
Updated at April 30, 2025
  • Synthesis options
  • Languages and voices
    • Role
    • Voice tone
  • Controlling pronunciation
  • Synthesis settings
    • Synthesized speech speed
    • Volume normalization
    • Synthesized audio file format
  • Use cases

Speech synthesis in Yandex SpeechKit allows you to convert any text to speech in multiple languages.

SpeechKit voice models use deep neural network technology. When synthesizing speech, the model pays attention to many details in the original voice. The model evaluates the entire text, not individual sentences, before starting the synthesis. This enables the synthesized voice to sound clear and natural, without electronic distortion, and reproduce appropriate intonations of a real person's speech.

The service is available at tts.api.cloud.yandex.net:443.

To try out the Text-to-Speech and Speech-to-Text product demos, visit the SpeechKit page on our website.

Synthesis optionsSynthesis options

To work with SpeechKit, you can access it via the API or Playground. For more information about working with the Yandex Cloud API, see API concepts.

SpeechKit synthesis has two APIs: API v1 (REST) and API v3 (gRPC). The SpeechKit Python SDK is also implemented based on the API v3.

API v1 API v3
Specification REST gRPC, REST
Selecting voice voice parameter hints: voice parameter
Selecting language Depends on the voice
lang parameter
Depends on the voice, not specified explicitly in the request
Specifying role Depends on the voice
emotion parameter
Depends on the voice
hints: role parameter
Voice tone management No hints: pitch_shift parameter
Controlling pronunciation SSML
TTS
TTS
Pronunciation speed speed parameter hints: speed parameter
Adjusting volume No loudness_normalization_type parameter
Output audio format format parameter output_audio_spec parameter
Specifying LPCM parameters sampleRateHertz parameter output_audio_spec: raw_audio parameter
Audio pattern-based synthesis No text_template parameter
Pricing method Total number of characters in the requests By request
Automatic splitting of long phrases Not required unsafe_mode parameter

Note

Multiple responses with audio fragments can be returned to a single request to the SpeechKit API v3. A complete response is a result of merging all the fragments received.

Languages and voicesLanguages and voices

You can select a voice to convert your text to speech. Each voice corresponds to a model trained on the speaker's speech pattern. Voices differ by tone, gender, and language. For a list of voices and their specifications, see List of voices.

If no voice suits your business, SpeechKit can create a unique one specifically for you. For more information, see Yandex SpeechKit Brand Voice.

SpeechKit can synthesize speech in different languages. Each voice is designed to synthesize speech in a specific language. The voices can also read text in another language, but the quality of the synthesized speech will be worse in this case, as the speaker will pronounce the text with an accent, and there might be errors in word synthesis.

RoleRole

The synthesized speech will sound differently depending on the selected role. Role is a manner of pronunciation for the same speaker. Different sets of roles are available for different voices. Attempting to use a role the selected voice does not have will cause a service error.

Voice toneVoice tone

Each SpeechKit voice has a certain pitch. In API v3, you can change the voice by specifying a shift from its base tone. The shift is specified in the hints: pitch_shift parameter in the range [-1000;1000] (in Hz). The default value is 0. Positive hints: pitch_shift values make the voice higher, negative ones make it lower.

Controlling pronunciationControlling pronunciation

To have control over pronunciation in synthesized speech, mark up the source text explicitly. SpeechKit can synthesize speech from text marked up using Speech Synthesis Markup Language (SSML) or TTS markup. These markup methods enable you to set the length of pauses, the pronunciation of individual sounds, and more. SSML and TTS markup have different data transmission parameters:

  • SSML is only supported in API v1 requests. To transmit text in SSML format, include the ssml parameter in the request body and use the <speak> tag as a wrapper for the text. For more information about SSML tags, see SSML markup.
  • TTS markup is supported in the API v1 and API v3. In API v1 requests, transmit the text marked up according to TTS rules in the text parameter in the request body. API v3 and the Python SDK require no special parameters and consider any transmitted text as marked up according to TTS rules. For more information about using TTS markup, see TTS markup.

Warning

When using pattern-based synthesis, the markup outside the variable part is ignored.

Synthesis settingsSynthesis settings

You can configure both pronunciation and technical characteristics of the synthesized speech.

Synthesized speech speedSynthesized speech speed

The speed of synthesized speech affects perception of information. If the speech is too fast or too slow, it sounds unnatural. However, this can be useful in commercials where every second of air time counts.

By default, the speed of generated speech corresponds to the average speed of human speech.

Volume normalizationVolume normalization

In API v3 and Python SDK requests, you can set the type and level of volume normalization. This can be useful if you are using SpeechKit synthesis along with other sound sources. For example, so that the volume of the voice assistant does not differ from the phone notifications.

SpeechKit supports two normalization types:

  • MAX_PEAK peak normalization, where the audio level rises to the maximum possible distortion-free value attainable for digital audio.
  • LUFS weighted normalization based on the EBU R 128 standard, where volume is normalized relative to the full digital scale.

The normalization type is set in the loudness_normalization_type parameter. By default, SpeechKit uses LUFS.

The normalization level is set in the hints: volume parameter. Possible values depend on the normalization type:

  • For MAX_PEAK, the parameter can be set in the (0;1] range, the default value is 0.7.
  • For LUFS, the parameter varies in the [-149;0) range, the default value is -19.

If the normalization level falls outside the supported range, the SpeechKit server will return the InvalidArgument error.

Synthesized audio file formatSynthesized audio file format

You can select the audio file format that will be used by SpeechKit to synthesize speech.

For a full list of available formats and their specifications, see Supported audio formats.

Use casesUse cases

  • Developing a Telegram bot for text recognition in images, audio synthesis and recognition
  • Using Yandex API Gateway to set up speech synthesis in Yandex SpeechKit
  • Speech synthesis in the API v3
  • Speech synthesis in the REST API v3
  • Speech synthesis in OggOpus format using the API v1
  • Speech synthesis in WAV format using the API v1
  • Speech synthesis from SSML text using API v1

See alsoSee also

  • Try speech synthesis using our demo on the service page.
  • Review the parameters of the API v1 method and API v3 requests.

Was the article helpful?

Previous
Detecting the end of utterance
Next
List of voices
Yandex project
© 2025 Yandex.Cloud LLC