Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Yandex SpeechKit
  • SpeechKit technology overview
    • About the technology
    • List of voices
      • Overview
      • Data for Brand Voice Self Service
      • Data for SpeechKit Brand Voice Call Center
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy

In this article:

  • Source data for pattern-based synthesis
  • Requirements for pattern recordings
  • Recommendations for audio recording
  • Requirements for synthesized texts
  • How to use pattern-based synthesis
  1. Speech synthesis
  2. SpeechKit Brand Voice
  3. Data for SpeechKit Brand Voice Call Center

Pattern-based speech synthesis

Written by
Yandex Cloud
Updated at April 11, 2025
  • Source data for pattern-based synthesis
    • Requirements for pattern recordings
    • Recommendations for audio recording
    • Requirements for synthesized texts
  • How to use pattern-based synthesis

With pattern-based synthesis, you can synthesize speech based on pre-edited phrases, or patterns, with changeable key parts called variables. A new phrase is synthesized integrally instead of combining multiple pre-recorded and synthesized blocks. Speech intonations are copied from the pattern. This makes speech sound natural – just like real human speech.

Pattern-based synthesis is only available for SpeechKit Brand Voice voices.

Warning

Access to the SpeechKit Brand Voice service is available on request. To access the technology, fill out this form.

Currently, this feature works only for Russian.

Source data for pattern-based synthesisSource data for pattern-based synthesis

Each request sent for pattern-based synthesis must include:

  • An audio recording of the pattern phrase.
  • Text template with markup of variable parts.
  • Values of variables for synthesizing a new phrase.
  • Timestamp of the beginning and end of each variable part.

Warning

The normalized text of the variable part of the phrase cannot be longer than 25% of the phrase length. The same restriction applies to the duration of the variable part relative to the duration of the final audio.

Let's assume you need to synthesize this phrase: Let's check your booking. Your flight is at eleven twenty a.m., September eight. You depart from Heathrow and arrive at Barajas. Is everything correct?
Your pattern audio recording contains the following phrase: Let's check your booking. Your flight is at nine p.m., March twenty-three. You depart from Heathrow and arrive at Barajas. Is everything correct?.
The text pattern should look like this: Let's check your booking. Your flight is at {time}, {date}. You depart from Heathrow and arrive at Barajas. Is everything correct?
List of variables: variable_name = '{date}', variable_value = 'september eight', variable_name = '{time}', variable_value = 'eleven twenty a.m.'.

Learn more about examples of pattern-based synthesis.

Requirements for pattern recordingsRequirements for pattern recordings

Requirement Value
Sampling frequency 48 kHz for SpeechKit Brand Voice Self Service and SpeechKit Brand Voice Premium
8 kHz or higher for SpeechKit Brand Voice Call Center
Audio bit depth 16 bit PCM
Number of channels 1 (mono)
Format WAV
Minimum pattern audio recording duration 1 second

Pattern audio recordings should not include extraneous noise or echo. Minimum post-processing of audio recordings is acceptable. The silence at the beginning and end of a recording must not be longer than one second.

Recommendations for audio recordingRecommendations for audio recording

Noise and echo during recording directly affect the quality of training and performance of your speech synthesis model. For this reason, when recording audio for training and fine-tuning your model or patterns for synthesis, try to reduce sound reflection in the room. A perfect place for recording is a room equipped with acoustic panels. If recording in a household environment, you can reduce the echo with soft surfaces and furniture, such as carpet, upholstered sofa, etc.

Use equipment that has the same settings and location for any audio recording.

Make sure the sound engineer, speaker, and voice coach are different people. A voice coach is a professional who makes sure the speaker strictly follows the text without changing the manner of narration.

When recording any audio intended for your speech synthesis model, use the one phrase per file rule.

No inaccuracies, micro repeats, reservations, and word substitutions are allowed. Any speech defects in the recordings made for synthesis models substantially degrade the quality of the synthesized speech.

Each recording of a pattern or phrase for model training must have an absolutely accurate text transcript. Patterns, training phrases, and their transcripts must not contain grammatical errors.

Requirements for synthesized textsRequirements for synthesized texts

The pattern must match the desired script.

The pattern must contain one or more variables to replace. The length of the variable portion must match the sample in the pattern. If the audio length of variable parts may vary substantially, we recommend recording several patterns with examples of different lengths. The normalized text of the variable part of the phrase cannot be longer than 25% of the pattern length. The same restriction applies to the duration of the variable part relative to the duration of the final audio.

For example, if in the pattern Hi, my name is Iris, I work at {company}., the length of the company variable may vary substantially (e.g., Thunderclouds, Reliable Bank, a regional office of a well-known international company), you will need to record two patterns: for short and long values. Use a suitable pattern depending on the placeholder value length. This way, you can keep your synthesized speech natural.

Divide long utterances into independent sentences and different patterns. The synthesized phrase cannot exceed 24 seconds and 250 characters.

  • When recording Russian speech, make sure to use the letter ё when needed in the pattern text. You cannot use е instead of ё in any synthesized text.

  • In homographs where the stress may be ambiguous, explicitly indicate the stressed vowel with the + sign.

    The +import of goods from foreign countries.: Stress falls on the first syllable.
    Goods they im+port from China.: Stress falls on the second syllable.

  • The lines of text for synthesis cannot include numbers or abbreviations, such as 15 Smith st., bl. 3 or USD 13.10. Make sure to spell out all numbers and abbreviations.

    Fifteen Smith street, block three.
    Thirteen dollars and ten cents.

  • In questions, specify which word should be **logically stressed** to indicate the intonation of an interrogative sentence.

    The sentence Did the cat go to the forest? can be read in three different ways:

    • Did the **cat** go to the forest? Meaning Who went to the forest? Was it really the cat?
    • Did the cat **go** to the forest? Meaning Did the cat walk or run? or Was the action performed? Is the cat gone or not?
    • Did the cat go **to the forest**? Meaning Where did the cat go and why? To the forest, outside, to look for a sausage?

    In all sentences, the logical stress emphasizes the main meaning of the sentence.

Warning

The logical stress should not fall on the variable part of the phrase.

How to use pattern-based synthesisHow to use pattern-based synthesis

  • API v3:

    • Brand Voice Premium and Brand Voice Self Service.
    • SpeechKit Brand Voice Call Center.

Was the article helpful?

Previous
Data for Brand Voice Self Service
Next
About the technology
Yandex project
© 2025 Yandex.Cloud LLC