Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • About the technology
    • List of voices
      • Overview
      • Data for SpeechKit Brand Voice Lite
      • Data for SpeechKit Brand Voice Call Center
      • Tips for recording roles
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy
  • Audit Trails events

In this article:

  • Data required for training
  • Text requirements
  • Audio recording recommendations
  • Requirements for audio recordings
  • How to create a voice
  • Voice statuses
  • How to access a voice in a fine-tuned model
  1. Speech synthesis
  2. SpeechKit Brand Voice
  3. Data for SpeechKit Brand Voice Lite

Data for SpeechKit Brand Voice Lite

Written by
Yandex Cloud
Updated at September 26, 2025
  • Data required for training
  • Text requirements
  • Audio recording recommendations
    • Requirements for audio recordings
  • How to create a voice
  • Voice statuses
  • How to access a voice in a fine-tuned model

SpeechKit Brand Voice Lite allows creating your unique voice based on a minimum of marked-up audio samples. This will get you the ID of a model you will be able to access from your applications via API.

Data required for trainingData required for training

To create a SpeechKit Brand Voice Lite voice of your own, you need to prepare a training dataset consisting of audio recordings and the texts they are based on. Each audio must strictly match the text that comes with it.

The size of the training dataset depends on the speaker's articulation and the purpose of the new voice. For example, a voice used to narrate literary fiction would require more data than a voice for a virtual assistant. You need at least 30 minutes of audio, not counting the silence at the beginning and end of each file. The recommended duration is 60 minutes. For best results, use 90 minutes of audio or more. To create a voice with multiple roles, you will need a separate dataset for each one.

Tip

The result of SpeechKit Brand Voice Lite voice model training directly depends on training data. To get a high-quality SpeechKit Brand Voice Lite voice, adhere to all the recommendations we give for texts and audio.

At least 30% of the training data should contain questions so that the trained voice could reproduce the interrogative tone in texts. You can use ready-made phrases from the SpeechKit team or prepare your texts by yourself. For tips on creating and formatting training texts, see Text requirements.

You can upload a dataset with all audio recordings and transcripts as a single archive, upload audio files one by one, or record an audio for each phrase in the management console using your browser. In either case, make sure you follow the audio recording recommendations and listen through all your audios to ensure high quality of source data for voice model training. Once you start training, you will no longer be able to edit your dataset.

Text requirementsText requirements

The SpeechKit team provides a ready-made body of texts you only need to vocalize. It contains narrative, interrogative, and exclamatory phrases you would need in most cases to create voices in Russian. You can also prepare your own texts for audio but make certain they meet the following requirements:

  • Your texts must be free of grammatical errors.
  • The recommended phrase length is no more than 250 characters.
  • There must not be any incomplete phrases.

    ing time! He won't be happy about that. If you hadn't had that argument, you could have asked for anything you wa.

  • When recording Russian speech, make sure to use the letter ё when needed in the pattern text. You cannot use е instead of ё in any synthesized text.

  • In homographs where the stress may be ambiguous, explicitly indicate the stressed vowel with the + sign.

    The +import of goods from foreign countries.: Stress falls on the first syllable.
    Goods they im+port from China.: Stress falls on the second syllable.

  • The lines of text for synthesis cannot include numbers or abbreviations, such as 15 Smith st., bl. 3 or USD 13.10. Make sure to spell out all numbers and abbreviations.

    Fifteen Smith street, block three.
    Thirteen dollars and ten cents.

  • In questions, specify which word should be **logically stressed** to indicate the intonation of an interrogative sentence.

    The sentence Did the cat go to the forest? can be read in three different ways:

    • Did the **cat** go to the forest? Meaning Who went to the forest? Was it really the cat?
    • Did the cat **go** to the forest? Meaning Did the cat walk or run? or Was the action performed? Is the cat gone or not?
    • Did the cat go **to the forest**? Meaning Where did the cat go and why? To the forest, outside, to look for a sausage?

    In all sentences, the logical stress emphasizes the main meaning of the sentence.

Audio recording recommendationsAudio recording recommendations

Record in a quiet room without background noise. Fan, AC unit, fridge, street noise, background music, echo – all these will get into the recording and strongly affect the quality of your new voice. A room lined with acoustic panels would be the perfect place for your recording session. If recording at home, you can reduce echo with soft surfaces and furniture, such as a carpet, upholstered sofa, etc.

You do not have to use professional equipment to record audio. If you do not have a studio microphone, use the built-in microphone of your laptop, a smartphone, or a headset. The noise reduction feature of most headsets cuts off the beginning and end of the phrase; therefore, we recommend to turn if off. Do not change the hardware settings until you finish your recording. The distance to the microphone should be the same for all your audios.

Where possible, keep the device you are recording with unplugged from power supply: you may get acoustic interference while the battery is charging. Disable sound notifications and switch your smartphone to airplane mode to avoid sounds you do not want in your recording.

Record using maximum quality without compression. If using special sound recording software, make sure that file compression and extra filters are off.

Before you begin, check that there is no hunger, thirst, excessive salivation, or dry mouth. Remove all rustling objects and try not to move your hands and legs too often during the session. Relax, straighten out your back and neck, breathe deeply and freely. The sounds of you breathing must not drown out the audio.

Make the text sound as smooth and natural as possible. Your emotions should be aligned with the text you are reading. Keep your intonation the same for all samples within a particular role. For detailed recommendations on recording roles, see Tips for recording roles.

If you have made a mistake in pronunciation, redo the whole phrase, do not try to correct yourself in the same audio. No inaccuracies, micro repeats, reservations, and word substitutions are allowed. Any articulation defects in the recordings the synthesis models runs on greatly degrade the quality of synthesized speech.

After you record a phrase, listen to the output audio. Make sure all words are clear and distinct, there are small stretches of silence at the beginning and end of the audio, and there is no background noise.

Requirements for audio recordingsRequirements for audio recordings

If you are using special sound recording software instead the management console via your browser, make sure it meets the following criteria:

Requirement Value
Sampling frequency 48 kHz
Audio bit depth 16 bit PCM
Number of channels 1 (mono)
Format WAV
Duration ≤ 15 seconds
Silence intervals at the beginning and end 100–200 ms

Warning

Each recording should contain a full phrase of one or several sentences. A recording cannot be longer than 15 seconds. The transcripts in the table must exactly match the text in the audio.

Each audio must start and end with an interval of silence; there must be no clipped sounds or words. You cannot simply take a podcast recording an cut into 15-second pieces. If you do so, the segments will have their boundaries in the middle of words or phrases and will not be aligned with logical phrases. You cannot train a high quality model on such data.

Example of a well-prepared audio: several milliseconds of silence at the beginning and end of the audio, the phrase is complete.

good-audio

Example of a poorly made audio: no silence intervals at the beginning and end of the audio, the start and end of the phrase are cut off.

bad-audio

Such audio recordings will have distortions and are not suitable for model training.

How to create a voiceHow to create a voice

Recording audio via the browser
Uploading audio files
Uploading a ZIP archive
  1. In the management console, select the folder you are going to use to work with SpeechKit.
  2. In the list of services, select SpeechKit.
  3. In the left-hand panel, click Brand Voice Lite.
  4. Click Create voice.
  5. Click Create a dataset and select Record audio files.
  6. Select the data you want to record: click Use a ready-made template or Upload your own texts.
  7. Click Record audio ⟶ Record next to the text and then read the text aloud.
  8. Click to listen to the result.
  9. If your recording contains no background noise, the phrase was recorded fully, and you can hear it well, click Save. To rerecord, click New record.
  10. Record audio for the remaining texts.
  11. Under Use of voice:

    1. Read and confirm having read the SpeechKit Terms of Use.
    2. Confirm that you have a written consent of the person whose voice is used in the audio for the use of this audio to create and use speech synthesis.
  12. Click Create voice to start creating a voice.
  1. In the management console, select the folder you are going to use to work with SpeechKit.
  2. In the list of services, select SpeechKit.
  3. In the left-hand panel, click Brand Voice Lite.
  4. Click Create voice.
  5. Click Create a dataset and select Add audio files.
  6. Select the texts used to record your audio files: click Use a ready-made template or Upload your own texts.
  7. In the row with the text, click Add audio file.
  8. Add audio files for the remaining texts.
  9. Under Use of voice:

    1. Read and confirm having read the SpeechKit Terms of Use.
    2. Confirm that you have a written consent of the person whose voice is used in the audio for the use of this audio to create and use speech synthesis.
  10. Click Create voice to start creating a voice.
  1. In the management console, select the folder you are going to use to work with SpeechKit.

  2. In the list of services, select SpeechKit.

  3. Click Create voice.

  4. Click Create a dataset and select Download ZIP archive.

  5. Drag a ZIP archive to the upload area.

    A ZIP archive with a fine-tuning dataset must contain:

    • Audio recordings in WAV format.
    • UTF-8 encoded TSV table with the transcripts of recordings from the archive. The table should have two columns without headers:
      • Name of the audio file with the speaker's text.
      • Line with a verbatim transcript of the recording.
  6. Under Use of voice:

    1. Read and confirm having read the SpeechKit Terms of Use.
    2. Confirm that you have a written consent of the person whose voice is used in the audio for the use of this audio to create and use speech synthesis.
  7. Click Create voice to start creating a voice.

If using a template by the SpeechKit team, add the audio file name to the beginning of each line and delete the table header line.

Prepared data example

The table header is given as an example, it must be excluded from the uploaded file.

recordings text
1.wav Books are a uniquely portable magic.
2.wav We have an amazing book for you!
3.wav This book is suitable for children from the age of five.

Voice statusesVoice statuses

As soon as you start voice creation, the voice will appear in the list of available voices under Brand Voice Lite. The creation process will take several days. During this period, the voice have the Creating status. Once the process is complete, the voice will become available for testing and its status will switch to Trial. You will have seven days to use the voice for free in SpeechKit Playground and synthesize speech via the API, only paying for the requests (see pricing). Once the trial period expires, the voice will be archived, its status will switch to Archived, and you will no longer be able to use it. Hosting archived voices is free of charge.

If you like the voice you created, you can activate it without waiting for the trial period to end. Also, you can activate an archived voice:

  1. In the management console, select the folder you are going to use to work with SpeechKit.
  2. In the list of services, select SpeechKit.
  3. In the left-hand panel, click Brand Voice Lite.
  4. Select a voice and navigate to its page.
  5. In the top-right corner, click Activate.
    Once activated, the voice will switch its status to Active and become available via the API and in SpeechKit Playground without limitations. The hosting will become chargeable.

If you no longer need the voice, you can archive it.

How to access a voice in a fine-tuned modelHow to access a voice in a fine-tuned model

The voice you created will be available in SpeechKit Playground and via API v3. To use the voice via the API, specify the obtained model ID in synthesis settings:

API
{
...
"model": "tts://<folder_ID>/bvss-v1/latest@<voice_ID>/?<model_ID>"
...
}

Example

Use the IAM token to authenticate as a Yandex account or federated account. The account must have the ai.speechkit-tts.user role. For other authentication methods, see Authentication with the SpeechKit API.

cURL

To reproduce this example, you will need the jq utility to work with JSON files.

  1. Create the tts_rest.json file with the following request parameters:

    {
      "text": "Hi! I'm Yandex Speech+Kit. I can turn any text into speech. Now y+ou can, too!",
      "model": "tts://<folder_ID>/bvss-v1/latest@<voice_ID>/?<model_ID>" 
    }
    

    Where:

    • text: Text to synthesize.
    • model: Fine-tuned model you are accessing.
  2. Run the request in the terminal by specifying the IAM token and the ID of the folder you will use to work with SpeechKit:

    export FOLDER_ID=<folder_ID>
    export IAM_TOKEN=<IAM_token>
    
      --header "Authorization: Bearer $IAM_TOKEN" \
      --header "x-folder-id: $FOLDER_ID" \
      --data @tts_rest.json https://tts.api.cloud.yandex.net:443/tts/v3/utteranceSynthesis | \
      jq -r  '.result.audioChunk.data' | \
      while read chunk; do base64 -d <<< "$chunk" >> audio_my.wav; done
    

    Where:

    • FOLDER_ID: ID of the folder for which your account has the ai.speechkit-tts.user role or higher. If using a service account, you do not need to include the folder ID in the request.
    • IAM_TOKEN: IAM token of your Yandex account or federated account.

    The synthesized speech will be Base64 encoded and saved to a file named audio_my.wav.

Was the article helpful?

Previous
Overview
Next
Data for SpeechKit Brand Voice Call Center
© 2025 Direct Cursus Technology L.L.C.