Voice agents
Voice agents in Yandex AI Studio enable you to create AI-powered applications that engage with users in both written and spoken form.
Voice agents support two-way message exchange: the client sends events with audio or text instructions, and the server delivers responses as soon as they are generated. With this streaming mode, you can see partial responses immediately without waiting for full processing, which ensures the dialogue evolves in a natural way.
You can use voice agents for the following scenarios:
- Suggestions for support agent: While the support agent communicates with the client through voice or text, the voice agent analyzes their conversation in real time to suggest a ready-made answer or a link to the relevant guide. This enables the support agent to answer questions more quickly and accurately.
- Voice agent: Next-level alternative to voice bots. The voice agent can take orders and answer support questions, responding in real time like a human support agent.
- Automatic call summarization: The voice agent analyzes the audio stream in real time and highlights key points, such as topics, arrangements, and further steps. As the conversation winds down, it generates a call summary and a checklist and sends them to the CRM
system or chat.
Realtime API
In AI Studio, you can create voice agents using Realtime API, an event-driven interface for real-time voice interaction between the server and client over WebSocket
Model
To process user requests, Realtime API runs a multimodal model tailored for interactive voice and mixed voice-and-text scenarios with minimum latency:
| Model and URI | Context | Operating modes |
|---|---|---|
speech-realtime-250923gpt://<folder_ID>/speech-realtime-250923 |
32,000 | Synchronous |
Designed for Russian-language scenarios, the model is well equipped for creating voice assistants, chat environments, and apps requiring a natural, human-like flow of conversation.
Voices
Realtime API is compatible with all standard Yandex SpeechKit voices as well as SpeechKit Brand Voice Lite and SpeechKit Brand Voice Premium voices.
You can listen to voice samples on the SpeechKit page.
Sessions
The client-server communication context is stored in sessions. Each session holds the conversation history and configuration settings, such as the model's system prompt, selected speech synthesis voice, and expected modalities (text or voice).
The system creates a session upon establishing a WebSocket connection and keeps it active until either the connection closes or the session lifetime expires.
The session lifetime is five minutes, but it can be increased to ten minutes if required. You can edit the session configuration settings while the conversation is in progress. For example, you can update the system prompt, speech synthesis voice, or modalities.
To continue working after the current session is closed, create a new session.
Events
Events are the main way to exchange data in Realtime API. Each client-server communication takes the form of an event with the required type field, which stands for the event type to indicate its purpose.
The client sends events to transmit data, instructions, or commands, e.g., to generate a new answer, upload audio, or edit the session settings. The server responds with events containing partial or final results along with state updates.
The event exchange is two-way and asynchronous: the client can send new events to the server without waiting for the result of the previous task. This approach enables real-time handling of responses and reacting to them immediately, without waiting for full processing to complete.
Each event is transmitted as a separate JSON
Event types
Realtime API supports the following event types:
session.update: Session settings update, e.g., change in the synthesis voice or model's system prompt.input_audio_buffer.append: Transmission of an audio segment (in PCM format, mono, 24 kHz, 16-bit, Base64 -encoded).input_audio_buffer.commit: Completion of audio transmission.response.create: Start of the new model's response generation.response.output_text.delta: Segment of the response as text (streaming).response.output_text.delta: Segment of the response as audio (PCM, Base64).response.output_audio.done: Last segment of the response as audio.response.done: Completion of the model's response generation.error: Error message.