Yandex Cloud
Search
Contact UsTry it for free
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
    • Price calculator
    • Pricing plans
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • About the technology
    • Supported languages
    • Streaming recognition
    • Recognition result normalization
    • Analyzing recognition results
    • Speaker labeling
    • Using LLMs to process recognition results
    • Extending a speech recognition model
    • Detecting the end of utterance
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy
  • Audit Trails events
  1. Speech recognition
  2. Using LLMs to process recognition results

Using LLMs to process recognition results

Written by
Yandex Cloud
Updated at December 3, 2025

With the SpeechKit API v3, you can recognize audio and also process recognition results using large language models. This means you can automate processing of speech recognition results, e.g., instantly summarize the recognized audio, use the audio to add data to your CRM, or translate the conversation into English.

To use generative models in audio recognition, fill in the summarization section in the request configuration. You can specify only a text prompt for the model or provide the response structure as JSON. For more information on the response structure, see AI Studio documentation. To work with generative models in the SpeechKit API v3, you need the ai.languageModels.user role or higher.

Simple response
Arbitrary JSON
Strict JSON schema
{
...
  "summarization": {
    "modelUri": "gpt://<folder_ID>/<model_name>",
    "properties": [
      {
        "instruction": "Model prompt"
      }
    ]
  }
  ...
}
{
...
  "summarization": {
    "modelUri": "gpt://<folder_ID>/<model_name>",
    "properties": [
      {
        "instruction": "Model prompt requiring a structured output"
        "jsonObject": true
      }
    ]
  }
  ...
}
{
...
  "summarization": {
    "modelUri": "gpt://<folder_ID>/<model_name>",
    "properties": [
      {
        "instruction": "Model prompt requiring a structured output"
        "jsonSchema": {
          // Specified output schema 
          "schema": "<json_schema>"
        }
      }
    ]
  }
  ...
}

Where:

  • modelUri: Model available in synchronous mode.
  • instruction: Model prompt (Highlight the main points, Translate into English).
  • jsonObject: If true, the model will return JSON output in arbitrary formatting.
  • jsonSchema: Schema for strictly formatted JSON output.

The model will return its results in the summarization object:


{
...
  "result": {
    "summarization": {
      "results": [
        {
          "response": "LLM-generated text"
        },
        {
          "response": "LLM-generated JSON"
        }
      ],
      "contentUsage": {
        "inputTextTokens": 150,    // Input tokens
        "completionTokens": 80,    // Output tokens
        "totalTokens": 230         // Total consumption
      }
    }
  }
...
}

The contentUsage field contains token usage details. Calculated according to the AI Studio pricing policy, the cost of use depends on the selected model and total number of input and output tokens indicated in totalTokens.

Was the article helpful?

Previous
Speaker labeling
Next
Extending a speech recognition model
© 2025 Direct Cursus Technology L.L.C.