Preparing and uploading data for SpeechKit Brand Voice Lite

Written by

Updated at June 20, 2025

What data is required for training
- Example of prepared data
Prepare data
- Texts for model training
Record audio files
- General recommendations for audio recording
- Requirements for audio recordings
Upload your data

Create your unique voice with SpeechKit Brand Voice Lite by uploading a minimum of marked up audio samples. As a result, you will have a fine-tuned model URI you can access from your applications via API.

What data is required for training

To create your own SpeechKit Brand Voice Lite voice, you need to prepare and upload the training data: audio recordings and the texts they are based on. The amount of data needed to train the model depends on how you intend to use the voice and the speaker's manner of speech. For example, a voice used to narrate literary fiction would require more data than a voice for a virtual assistant.

You need to provide at least 20 minutes of training data; the recommended amount is 40 minutes or at least one hour to achieve best results. To train a model with several roles, create a dataset for each role by specifying its name in parameters.

Upload the training data as a single ZIP archive containing:

Audio recordings in WAV format.
UTF-8 encoded TSV table with the transcripts of recordings from the archive. The table should have two columns without headers:
- Name of the audio file with the speaker's text.
- Line with a verbatim transcript of the recording.

Warning

Each recording should contain a full phrase of one or several sentences. A recording cannot be longer than 15 seconds. The transcripts in the table must exactly match the text in the audio.

Each audio recording should start and end with silence, and any clipped sounds or words are not acceptable. You must not use a podcast recording cut into 15-second segments; doing so will split words and phrases, and the recordings will not match phrase boundaries. It is not possible to train a quality model on such data.

A correct audio recording should be as follows: there are a few milliseconds of silence at the beginning and end of the audio, and the phrase is complete.

good-audio

A poor audio recording would be as follows: there are no silence intervals at the beginning and end of the audio, and the start and end of the phrase are cut off.

bad-audio

Such audio recordings will have distortions and are not suitable for model training.

Example of prepared data

Before uploading data to create a voice, place all audio files (1.wav, 2.wav, 3.wav) and the TSV table matching the audio files with their transcripts into a ZIP archive.

The table header is given as an example, it must be excluded from the uploaded file.

recordings	text
1.wav	Books are a uniquely portable magic.
2.wav	We have an amazing book for you!
3.wav	This book is suitable for children from the age of five.

Prepare data

Texts for model training

At least 30% of the training data should contain questions so that the trained voice could reproduce the interrogative tone in texts.

To train your model, you can use texts prepared by the SpeechKit team or create them yourself. If you are using the SpeechKit example, add the names of the corresponding audio files to the beginning of each row and then delete the row with the table header.

Requirements for all texts

Tip

The result of SpeechKit Brand Voice Lite voice model training directly depends on training data. Make sure to meet the specified text requirements and recommendations to get a high-quality SpeechKit Brand Voice Lite voice.

Each audio recording must have an absolutely accurate text transcript.
The recommended phrase length is no more than 250 characters.
Audio recordings should not include incomplete phrases.

ing time! He won't be happy about that. If you hadn't had that argument, you could have asked for anything you wa.
Texts must be free of grammatical errors.

When recording Russian speech, make sure to use the letter ё when needed in the pattern text. You cannot use е instead of ё in any synthesized text.
In homographs where the stress may be ambiguous, explicitly indicate the stressed vowel with the + sign.

The +import of goods from foreign countries.: Stress falls on the first syllable.
Goods they im+port from China.: Stress falls on the second syllable.
The lines of text for synthesis cannot include numbers or abbreviations, such as 15 Smith st., bl. 3 or USD 13.10. Make sure to spell out all numbers and abbreviations.

Fifteen Smith street, block three.
Thirteen dollars and ten cents.
In questions, specify which word should be **logically stressed** to indicate the intonation of an interrogative sentence.
The sentence Did the cat go to the forest? can be read in three different ways:
- Did the **cat** go to the forest? Meaning Who went to the forest? Was it really the cat?
- Did the cat **go** to the forest? Meaning Did the cat walk or run? or Was the action performed? Is the cat gone or not?
- Did the cat go **to the forest**? Meaning Where did the cat go and why? To the forest, outside, to look for a sausage?
In all sentences, the logical stress emphasizes the main meaning of the sentence.

Record audio files

General recommendations for audio recording

Noise and echo during recording directly affect the quality of training and performance of your speech synthesis model. For this reason, when recording audio for training and fine-tuning your model or patterns for synthesis, try to reduce sound reflection in the room. A perfect place for recording is a room equipped with acoustic panels. If recording in a household environment, you can reduce the echo with soft surfaces and furniture, such as carpet, upholstered sofa, etc.

Use equipment that has the same settings and location for any audio recording.

Make sure the sound engineer, speaker, and voice coach are different people. A voice coach is a professional who makes sure the speaker strictly follows the text without changing the manner of narration.

When recording any audio intended for your speech synthesis model, use the one phrase per file rule.

No inaccuracies, micro repeats, reservations, and word substitutions are allowed. Any speech defects in the recordings made for synthesis models substantially degrade the quality of the synthesized speech.

Each recording of a pattern or phrase for model training must have an absolutely accurate text transcript. Patterns, training phrases, and their transcripts must not contain grammatical errors.

Requirements for audio recordings

Requirement	Value
Sampling frequency	48 kHz
Audio bit depth	16 bit PCM
Number of channels	1 (mono)
Format	WAV
Duration	max. 15 seconds
Silence intervals at the beginning and end	100–200 ms

Save all the audio recordings as a ZIP archive to upload them together with the TSV table with texts.

Upload your data

Use the management console interface to upload the archive with text data and audio files.

Preparing and uploading data for SpeechKit Brand Voice Lite

What data is required for trainingWhat data is required for training

Example of prepared dataExample of prepared data

Prepare dataPrepare data

Texts for model trainingTexts for model training

Requirements for all textsRequirements for all texts

Record audio filesRecord audio files

General recommendations for audio recordingGeneral recommendations for audio recording

Requirements for audio recordingsRequirements for audio recordings

Upload your dataUpload your data

Was the article helpful?