Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • Speech recognition using Playground
    • Speech synthesis using Playground
      • Speech synthesis in the API v3
      • Speech synthesis in REST API v3
      • Pattern-based speech synthesis
      • Brand Voice Call Center pattern-based speech synthesis
      • Speech synthesis in WAV format, API v1
      • Speech synthesis in OggOpus format, API v1
      • Speech synthesis from SSML text, API v1
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy

In this article:

  • Getting started
  • Prepare an audio template
  • Send data to the API
  1. Step-by-step guides
  2. Speech synthesis
  3. Brand Voice Call Center pattern-based speech synthesis

Pattern-based speech synthesis in SpeechKit Brand Voice Call Center

Written by
Yandex Cloud
Updated at February 10, 2025
  • Getting started
  • Prepare an audio template
  • Send data to the API

Warning

Access to the SpeechKit Brand Voice Call Center service is available on request. To access the technology, fill out this form.

SpeechKit Brand Voice Call Center allows you synthesize and record natural-sounding voice pattern phrases (templates) often used by call centers for customer calls.

SpeechKit Brand Voice Call Center synthesizes audio with a sampling rate of 8 kHz. Yet the sampling rate of the synthesized speech file may be even higher. This is similar to recording a telephone conversation using high quality equipment. Even though you record at a high sampling rate, the telephone channel still reduces the audio signal to 8 kHz.

In SpeechKit Brand Voice Call Center, you can adjust the sampling rate of the output audio files using various audio formats:

  • OggOpus, MP3, or WAV container formats: The synthesized speech file is processed using standard SpeechKit specifications and audio compression algorithms.

    Use the container_audio parameter for these audio formats when submitting your request to API v3.

  • RAW format: No compression algorithms are used, and you can specify a custom sampling rate.

    For raw audio, use the raw_audio parameter when submitting your request to API v3.

Below are some examples of how to use API v3 to synthesize speech into a standard or raw audio format. Use these examples to create a custom speech synthesis solution with SpeechKit Brand Voice Call Center.

To synthesize template phrases for a call center:

  1. Prepare an audio recording template.
  2. Send data to the API.

Getting started

  1. Clone the Yandex Cloud API repository:

    git clone https://github.com/yandex-cloud/cloudapi
    
  2. Create a service account to work with the SpeechKit API.

  3. Assign the ai.speechkit-tts.user role, or higher, to the service account, which will allow it to work with SpeechKit in the folder it was created in.

  4. Get an IAM token for the service account.

Prepare an audio template

Template parameter restrictions:

  • A phrase for synthesis cannot be longer than 24 seconds (the API limit), including the variable part. A phrase ideal for perception is under 16 seconds, as in a regular conversation.
  • Template length: not more than 250 characters of normalized text.
  • The normalized text of the variable part must not exceed 25% of the template length. The same restriction applies to the duration of the variable part relative to the duration of the final audio.

To prepare a template:

  1. Make an audio recording of your pattern phrase.

  2. Mark up the text of the pattern phrase according to the requirements and enclose the variable parts in {} (curly brackets).

    Marked-up text example: This is to remind you that a waste collection truck will arrive at your place on {date} at {time}.
    List of variables: variable_name = '{date}', variable_value = 'tomorrow', variable_name = '{time}', variable_value = 'fourteen hours'.

  3. For each variable part, set its start time and duration in the audio recording (in milliseconds).

Send data to the API

Create and run a client app to send your data to the API:

Python 3
  1. Install dependencies using the PIP package manager:

    pip install grpcio-tools && \
    pip install pydub
    

    You need the grpcio-tools package to generate the interface code for the API v3 synthesis client.

    You need the pydub package to process the resulting audio files.

  2. Download the FFmpeg utility for correct operation of the pydub package.

  3. Add the path to the folder with the FFmpeg executable to the PATH variable:

    export PATH=$PATH:<path_to_folder_with_FFmpeg_executable>
    
  4. Go to the folder with the Yandex Cloud API repository, create a folder named output and generate the client interface code in it:

    cd <path_to_cloudapi_folder> && \
    mkdir output && \
    python3 -m grpc_tools.protoc -I . -I third_party/googleapis \
      --python_out=output \
      --grpc_python_out=output \
      google/api/http.proto \
      google/api/annotations.proto \
      yandex/cloud/api/operation.proto \
      google/rpc/status.proto \
      yandex/cloud/ai/tts/v3/tts_service.proto \
      yandex/cloud/ai/tts/v3/tts.proto
    

    The tts_pb2.py, tts_pb2_grpc.py, tts_service_pb2.py, and tts_service_pb2_grpc.py client interface files, as well as dependency files, will be created in the output folder.

  5. Create a file named test.py in the root of the output folder and add into it the code for API v3 pattern-based synthesis.

    Use one of the files below depending on whether you want to use one of the container formats or the raw format:

    Container formats: OggOpus, MP3, or WAV
    import io
    import grpc
    import pydub
    import argparse
    
    import yandex.cloud.ai.tts.v3.tts_pb2 as tts_pb2
    import yandex.cloud.ai.tts.v3.tts_service_pb2_grpc as tts_service_pb2_grpc
    
    def synthesize(iam_token, bytes_array) -> pydub.AudioSegment:
        template = "<template_phrase_with_markup>"
    
        # Template example: 'This is to remind you that your child has an appointment for {procedure} tomorrow at {time}.'
        request = tts_pb2.UtteranceSynthesisRequest(
            output_audio_spec=tts_pb2.AudioFormatOptions(
                container_audio=tts_pb2.ContainerAudio(
                    # This example uses the WAV format for audio compression. You can specify OGG_OPUS or MP3 instead of WAV for other formats of the same name.
                    container_audio_type=tts_pb2.ContainerAudio.WAV
                )
            ),
            loudness_normalization_type=tts_pb2.UtteranceSynthesisRequest.LUFS,
            text_template = tts_pb2.TextTemplate(
                text_template = template,
                variables = [
                    # List template variables.
                    # The number of tts_pb2.TextVariable() list items depends on the number of template variables.
                    tts_pb2.TextVariable(
                        variable_name = "<variable_name_in_template>",
                        # Example: '{time}'.
                        variable_value = "<text_for_synthesis>"
                        # Example: 'eight hours and thirty minutes'.
                    )
                ]
            ),
            hints = [
                tts_pb2.Hints(
                    audio_template = tts_pb2.AudioTemplate(
                        audio = tts_pb2.AudioContent(
                            # Upload a source audio for your template.
                            content = bytes_array,
                            audio_spec = tts_pb2.AudioFormatOptions(
                                container_audio = tts_pb2.ContainerAudio(
                                    container_audio_type = tts_pb2.ContainerAudio.WAV
                                )
                            )
                        ),
                        text_template = tts_pb2.TextTemplate(
                            text_template = template,
                            variables = [
                                # List template variables.
                                # The number of tts_pb2.TextVariable() list items depends on the number of template variables.
                                tts_pb2.TextVariable(
                                    variable_name = "<variable_name_in_template>",
                                    variable_value = "<text_of_phrase_variable_in_template_audio_file>"
                                )
                            ]
                        ),
                        variables = [
                            # List the audio parameters of template variables.
                            # The number of tts_pb2.AudioVariable() list items depends on the number of template variables.
                            tts_pb2.AudioVariable(
                                variable_name = "<variable_name_in_template>",
                                # Length of the variable part of the phrase in the template audio (ms).
                                variable_length_ms = <variable_length>,
                                # Start of the variable part of the phrase in the template audio (ms).
                                variable_start_ms = <variable_start_time>
                            )
                        ]
                    )
                )
            ],
            # Special SpeechKit Brand Voice Call Center model, available on request. 
            model = "zsl"
        )
    
        # Establish a connection with the server.
        cred = grpc.ssl_channel_credentials()
        channel = grpc.secure_channel('tts.api.cloud.yandex.net:443', cred)
        stub = tts_service_pb2_grpc.SynthesizerStub(channel)
    
        # Send data for synthesis.
        it = stub.UtteranceSynthesis(
            request,
            metadata=(
                # You can use an API key for tests.
                # ('authorization', f'Api-Key {api_key}'),
                # For business solutions, it is better to use an IAM token.
                ('authorization', f'Bearer {iam_token}'),
            ),
        )
    
        # Process the server responses and write the result to a file.
        try:
            audio = io.BytesIO()
            for response in it:
                audio.write(response.audio_chunk.data)
            audio.seek(0)
            return pydub.AudioSegment.from_wav(audio)
        except grpc._channel._Rendezvous as err:
            print(f'Error code {err._state.code}, message: {err._state.details}')
            raise err
    
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('--token', required=True, help='IAM token')
        parser.add_argument('--input', required=True, help='Template WAV file')
        parser.add_argument('--output', required=True, help='Output WAV file with synthesized speech')
        args = parser.parse_args()
    
        with open(args.input, 'rb') as file:
            speech = file.read()
    
        audio = synthesize(args.token, speech)
        with open(args.output, 'wb') as fp:
            audio.export(fp, format='wav')
    
    RAW format
    import io
    import grpc
    import pydub
    import argparse
    
    import yandex.cloud.ai.tts.v3.tts_pb2 as tts_pb2
    import yandex.cloud.ai.tts.v3.tts_service_pb2_grpc as tts_service_pb2_grpc
    
    def synthesize(iam_token, bytes_array) -> pydub.AudioSegment:
        template = "<template_phrase_with_markup>"
        # Template example: 'This is to remind you that your child has an appointment for {procedure} tomorrow at {time}.'
        request = tts_pb2.UtteranceSynthesisRequest(
            output_audio_spec=tts_pb2.AudioFormatOptions(
                raw_audio=tts_pb2.RawAudio(
                    audio_encoding=tts_pb2.RawAudio.LINEAR16_PCM,
                    sample_rate_hertz=8000
                )
            ),
            loudness_normalization_type=tts_pb2.UtteranceSynthesisRequest.LUFS,
            text_template = tts_pb2.TextTemplate(
                text_template = template,
                variables = [
                    # List template variables.
                    # The number of tts_pb2.TextVariable() list items depends on the number of template variables.
                    tts_pb2.TextVariable(
                        variable_name = "<variable_name_in_template>",
                        # Example: '{time}'.
                        variable_value = "<text_for_synthesis>"
                        # Example: 'eight hours and thirty minutes'.
                    )
                ]
            ),
            hints = [
                tts_pb2.Hints(
                    audio_template = tts_pb2.AudioTemplate(
                        audio = tts_pb2.AudioContent(
                            # Upload a source audio for your template.
                            content = bytes_array,
                            audio_spec = tts_pb2.AudioFormatOptions(
                                container_audio = tts_pb2.ContainerAudio(
                                    container_audio_type = tts_pb2.ContainerAudio.WAV
                                )
                            )
                        ),
                        text_template = tts_pb2.TextTemplate(
                            text_template = template,
                            variables = [
                                # List template variables.
                                # The number of tts_pb2.TextVariable() list items depends on the number of template variables.
                                tts_pb2.TextVariable(
                                    variable_name = "<variable_name_in_template>",
                                    variable_value = "<text_of_phrase_variable_in_template_audio_file>"
                                )
                            ]
                        ),
                        variables = [
                            # List the audio parameters of template variables.
                            # The number of tts_pb2.AudioVariable() list items depends on the number of template variables.
                            tts_pb2.AudioVariable(
                                variable_name = "<variable_name_in_template>",
                                # Length of the variable part of the phrase in the template audio (ms).
                                variable_length_ms = <variable_length>,
                                # Start of the variable part of the phrase in the template audio (ms).
                                variable_start_ms = <variable_start_time>
                            )
                        ]
                    )
                )
            ],
            # Special SpeechKit Brand Voice Call Center model, available on request.
            model = "zsl"
        )
    
        # Establish a connection with the server.
        cred = grpc.ssl_channel_credentials()
        channel = grpc.secure_channel('tts.api.cloud.yandex.net:443', cred)
        stub = tts_service_pb2_grpc.SynthesizerStub(channel)
    
        # Send data for recognition.
        it = stub.UtteranceSynthesis(
            request,
            metadata=(
                # You can use an API key for tests.
                # ('authorization', f'Api-Key {api_key}'),
                # For business solutions, it is better to use an IAM token.
                ('authorization', f'Bearer {iam_token}'),
            ),
        )
    
        # Process the server responses and write the result to a file.
        try:
            audio = io.BytesIO()
            for response in it:
                audio.write(response.audio_chunk.data)
            audio.seek(0)
            return pydub.AudioSegment.from_raw(
                audio, sample_width=2, frame_rate=8000, channels=1
            )
        except grpc._channel._Rendezvous as err:
            print(f'Error code {err._state.code}, message: {err._state.details}')
            raise err
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('--token', required=True, help='IAM token')
        parser.add_argument('--input', required=True, help='Template WAV file')
        parser.add_argument('--output', required=True, help='Output WAV file with synthesized speech')
        args = parser.parse_args()
    
        with open(args.input, 'rb') as file:
            speech = file.read()
    
        audio = synthesize(args.token, speech)
        with open(args.output, 'wb') as fp:
            audio.export(fp, format='wav')
    

    To use the raw format and custom settings for the generated audio, in output_audio_spec.raw_audio, use the RawAudio() parameter supported by the SpeechKit API. The following parameters are nested in RawAudio():

    • audio_encoding: Audio encoding type. Only the LPCM format is supported.

    • sample_rate_hertz: Sampling rate of generated audio in Hz.

      For SpeechKit Brand Voice Call Center, the maximum sampling rate is 8,000 Hz. Specify this sampling rate to avoid adding unnecessary distortions when converting and compressing your audio file.

    You can set audio conversion settings in the from_raw() function supported by the pydub library. Function parameters:

    • audio: Source audio for the template.

    • sample_width: Audio bit depth. The possible values are:

      • 1: 8 bits
      • 2: 16 bits (CD quality)
      • 4: 32 bits

      SpeechKit Brand Voice Call Center supports 16-bit audio depth; therefore, put 2 for sample_width.

    • frame_rate: Sampling rate in Hz applied when converting the audio. Put 8000 so that the sampling rate of the generated audio matches the expected 8,000 Hz rate supported by SpeechKit Brand Voice Call Center.

    • channels: Number of audio channels. The possible values are:

      • 1: Mono
      • 2: Stereo

      SpeechKit Brand Voice Call Center supports mono signal; therefore, put 1 for channels.

  6. Execute the test.py file:

    export IAM_TOKEN=<service_account_IAM_token> && \
    python3 output/test.py \
      --token ${IAM_TOKEN} \
      --input sample.wav \
      --output speech.wav
    

    Where:

    • IAM_TOKEN: Service account IAM token.
    • --input: Name of the audio file with the template phrase.
    • --output: Name of the audio file with the synthesized phrase.

This will create a file named speech.wav with your synthesized pattern-based phrase in the cloudapi folder.

See also

  • Speech synthesis in the API v3
  • Yandex SpeechKit Brand Voice
  • Pattern-based speech synthesis

Was the article helpful?

Previous
Pattern-based speech synthesis
Next
Speech synthesis in WAV format, API v1
© 2025 Direct Cursus Technology L.L.C.