Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex AI Studio
    • All guides
    • Disabling request logging
    • Getting an API key
    • Image generation
    • Batch processing
      • Creating a voice agent
      • Creating a simple text agent
      • Creating a text agent with a function call
      • Managing Vector Store search index
  • Compatibility with OpenAI
  • Access management
  • Pricing policy
  • Audit Trails events
  • Public materials
  • Release notes

In this article:

  • Getting started
  • Create a voice agent
  1. Step-by-step guides
  2. Developing AI agents
  3. Creating a voice agent

Creating a voice agent

Written by
Yandex Cloud
Updated at October 28, 2025
  • Getting started
  • Create a voice agent

In Yandex AI Studio, you can create a voice agent that can communicate with the user in voice and text formats and maintain a dialogue closely resembling natural human interaction.

Getting startedGetting started

To use an example:

  1. Get an API key for the service account you are going to use to interact with the Realtime API. For more information, see Setting up access to Yandex AI Studio with API keys.
  2. Install Python 3.10 or higher.

  3. Install Python venv to create isolated virtual environments in Python.

  4. Create a new Python virtual environment and activate it:

    python3 -m venv new-env
    source new-env/bin/activate
    

Create a voice agentCreate a voice agent

The provided example illustrates the process of client-server interaction through the Realtime API. To create a voice agent:

  1. Create a file named voice-agent.py and paste the following code into it:

    import asyncio
    import json
    import base64
    import websockets
    import wave
    import subprocess
    
    
    API_KEY = "<API_key>"
    FOLDER = "<folder_ID>"
    INPUT_FILE = "<path_to_input_audio_file>"
    CONVERTED_FILE = "./converted.wav"
    OUTPUT_FILE = "./output.wav"
    CHUNK_SIZE = 15000  # Chunk size, in bytes (~100 ms at 24 kHz mono PCM16)
    
    REALTIME_URL = f"wss://rest-assistant.api.cloud.yandex.net/v1/realtime/openai?model=gpt://{FOLDER}/speech-realtime-250923"
    
    def make_silence_chunk(samples: int):
        """Generates a silence chunk of a specified duration (samples) in PCM16 format."""
        return b"\x00\x00" * samples
    
    
    async def main():
        # 1️⃣ Converting the input file to a format the Realtime API can understand:
        # mono, 24kHz, 16-bit PCM. Without this, the model will not accept the audio.
        print("🎧 Converting the input file...")
        subprocess.run([
            "ffmpeg", "-y", "-i", INPUT_FILE,
            "-ac", "1", "-ar", "24000", "-sample_fmt", "s16",
            CONVERTED_FILE
        ], check=True)
        print(f"✅ Converted {INPUT_FILE} -> {CONVERTED_FILE}")
    
        # 2️⃣ Establishing a WebSocket connection with the Realtime API.
        async with websockets.connect(
                REALTIME_URL,
                additional_headers=[("Authorization", f"api-key {API_KEY}")]
        ) as ws:
            print("✅ Connected to Realtime API")
    
            # 3️⃣ Waiting for the first "session.created" event with the session settings.
            created = json.loads(await ws.recv())
            print("✅ Session created:", created)
    
            # 4️⃣ Updating the session: enabling server_vad, selecting a voice and output format.
            await ws.send(json.dumps({
                "type": "session.update",
                "session": {
                    "type": "realtime",
                    "output_modalities": ["text", "audio"],  # we want both text and sound
                    "audio": {
                        "input": {
                            # Input audio format
                            "format": {"type": "audio/pcm", "rate": 24000},
                            # Server-side VAD configuration
                            "turn_detection": {
                                "type": "server_vad",  # enabling sever VAD
                                "threshold": 0.5,  # sensitivity
                                "silence_duration_ms": 400,  # silence considered to be the end of speech
                            }
                        },
                        "output": {
                            "format": {"type": "audio/pcm", "rate": 44100},
                            "voice": "marina"
                        }
                    },
                    "instructions": """
                If asked about changing seats in an airplane, pretend you have understood and done everything. 
    In your response, strictly follow these rules:
    You are assisting the user in the voice channel. 
    If the user mentions order number, seats, address, or other important info, make sure you go through it verbally in your response for the user to understand that their inputs were heard correctly.
    Keep your responses concise; basically, make sure they are easy to listen to.
                """
                }
            }))
    
            # 5️⃣ Reading the WAV file we prepared and sending it in chunks.
            with wave.open(CONVERTED_FILE, "rb") as wf:
                assert wf.getframerate() == 24000, "Expected 24 kHz"
                audio_bytes = wf.readframes(wf.getnframes())
                samples_per_chunk = CHUNK_SIZE // 2  # as 2 bytes = 1 sample
                print(f"📊 Sending {len(audio_bytes)} bytes ({wf.getnframes()} frames)")
    
            # Sending the audio data in chunks
            for i in range(0, len(audio_bytes), CHUNK_SIZE):
                await ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(audio_bytes[i:i + CHUNK_SIZE]).decode("ascii")
                }))
    
            # Sending ~500 ms of silence at the end for VAD to reliably detect the end of speech
            silence_chunk = make_silence_chunk(samples_per_chunk)
            for _ in range(5):  # 5 * 100ms = 500ms
                await ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(silence_chunk).decode("ascii")
                }))
    
            committed = False  # we will be monitoring the buffer to make sure it is committed
            full_text, audio_chunks = [], []
    
            try:
                while True:
                    msg = await ws.recv()
                    data = json.loads(msg)
                    print("📥 Event:", data)
    
                    if data.get("type") == "input_audio_buffer.committed" and not committed:
                        committed = True
                        print("✅ Audio buffer committed — creating a response...")
                        await ws.send(json.dumps({
                            "type": "response.create",
                            "response": {
                                "instructions": "Give a brief answer to the question from the audio."
                            }
                        }))
    
                    # Collecting audio deltas if the model sends text
                    if data.get("type") == "response.output_audio.delta":
                        audio_chunks.append(base64.b64decode(data["delta"]))
    
                    # Collecting text deltas if the model sends text
                    if data.get("type") == "response.output_text.delta":
                        full_text.append(data["delta"])
    
                    # When the response is fully ready → printing the text and saving the audio
                    if data.get("type") == "response.done":
                        print("✅ Response completed")
                        if full_text:
                            print("📝 Final text:", "".join(full_text))
                        if audio_chunks:
                            with wave.open(OUTPUT_FILE, "wb") as wf:
                                wf.setnchannels(1)
                                wf.setsampwidth(2)
                                wf.setframerate(44100)
                                wf.writeframes(b"".join(audio_chunks))
                            print(f"🔊 Audio saved to {OUTPUT_FILE}")
                        break
    
            except websockets.ConnectionClosed:
                print("🔌 Connection closed by server")
    
    
    if __name__ == "__main__":
        asyncio.run(main())
    

    Where:

    • API_KEY: API key of the service account used to call Realtime API methods.
    • FOLDER: ID of the folder the Realtime API is called in.
    • INPUT_FILE: Path to the file containing the input audio message. In this example, you can use any audio file as the input audio message, even the one generated using SpeechKit Playground.
    • CONVERTED_FILE: Path to the audio file where the input audio message will be saved when converting it to the required format (mono, 24 kHz, 16-bit PCM).
    • OUTPUT_FILE: Path to the audio file generated with the model response.
  2. Run the file you created:

    python3 voice-agent.py
    

    As a result of executing the code, the model response in text format will be displayed on the screen. The response is audio format will be saved to a file whose path is specified in the OUTPUT_FILE variable.

See alsoSee also

Voice agents

Was the article helpful?

Previous
Tuning an embedding model
Next
Creating a simple text agent
© 2025 Direct Cursus Technology L.L.C.