Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • Overview
    • How to recognize short audio files in the API v1
    • How to recognize long audio files in the API v3 and v2
    • How to synthesize speech in the API v1
    • How to synthesize speech in the API v3
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy

In this article:

  • Getting started
  • Speech recognition
  1. Getting started
  2. How to recognize long audio files in the API v3 and v2

How to recognize long audio files in SpeechKit

Written by
Yandex Cloud
Updated at April 11, 2025
  • Getting started
  • Speech recognition

The service can recognize speech in different ways. The provided example demonstrates asynchronous recognition of an audio file. Asynchronous recognition is available via API v3 and API v2. Asynchronous recognition is subject to these restrictions:

  • Maximum audio duration: 4 hours
  • Maximum file size: 1 GB

In the example, the API is used via the cURL utility. If you want to use the API via a Python script, see the relevant tutorials.

Getting startedGetting started

  1. Create a bucket and upload to it the audio file you want to recognize.

  2. Create a service account.

    Warning

    You can recognize audio files asynchronously only as a service account. Do not use any other Yandex Cloud accounts for the purpose.

  3. Assign to the service account the storage.uploader and ai.speechkit-stt.user roles for the folder you had created the bucket in.

  4. Get an API key or IAM token for your service account.

  5. Download a sample audio file:

    • For API v3: a WAV file.
    • For API v2: an LPCM file.

Speech recognitionSpeech recognition

API v3
API v2
  1. Get a link to an audio file in Object Storage.

  2. Create a file, e.g., request.json, and add the following code to it:

    {
      "uri": "https://storage.yandexcloud.net/<bucket_name>/<path_to_WAV_file_in_bucket>",
      "recognition_model": {
        "model": "general",
        "audio_format": {
          "container_audio": {
            "container_audio_type": "WAV"
          }
        }
      }
    }
    

    Where:

    • uri: Link to the audio file in Object Storage. Here is an example of such a link: https://storage.yandexcloud.net/speechkit/speech.wav.

      The link contains additional query parameters (after ?) for buckets with restricted access. You do not need to provide these parameters in SpeechKit as they are ignored.

    • model: Speech recognition model.

    • container_audio_type: Audio container format.

  3. Run the request using one of the service account authentication methods:

    • With an IAM token:

      export IAM_TOKEN=<service_account_IAM_token> && \
      curl \
        --insecure \
        --header "Authorization: Bearer ${IAM_TOKEN}" \
        --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsync
      

      Where IAM_TOKEN is the IAM token of the service account.

    • With an API key.

      Use API keys if requesting an IAM token automatically is not an option.

      export API_KEY=<service_account_API_key> && \
      curl \
        --insecure \
        --header "Authorization: Api-Key ${API_KEY}" \
        --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsync
      

    Result example:

    {
       "id":"f8ddr61b30fk********",
       "description":"STT v3 async recognition",
       "createdAt":"2024-07-15T07:39:36Z",
       "createdBy":"ajehumcuv38h********",
       "modifiedAt":"2024-07-15T07:39:36Z",
       "done":false,
       "metadata":null
    }
    

    Save the recognition operation id you get in the response.

  4. Wait until the recognition is complete. It takes about 10 seconds to recognize one minute of audio.

  5. Request information about the operation:

    • Authentication with an IAM token:

      curl \
        --insecure \
        --request GET \
        --header "Authorization: Bearer ${IAM_TOKEN}" \
        https://operation.stt.api.cloud.yandex.net/operations/<recognition_operation_ID>
      
    • Authentication with an API key:

      curl \
        --insecure \
        --request GET \
        --header "Authorization: Api-key ${API_KEY}" \
        https://operation.stt.api.cloud.yandex.net/operations/<recognition_operation_ID>
      

    Result example:

    {
       "done": true,
       "id": "f8ddr61b30fk********",
       "description": "STT v3 async recognition",
       "createdAt": "2024-07-15T07:39:36Z",
       "createdBy": "ajehumcuv38h********",
       "modifiedAt": "2024-07-15T07:39:37Z"
    }
    
  6. Request the operation result:

    • Authentication with an IAM token:

      curl \
        --insecure \
        --request GET \
        --header "Authorization: Bearer ${IAM_TOKEN}" \
        https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID>
      
    • Authentication with an API key:

      curl \
        --insecure \
        --request GET \
        --header "Authorization: Api-key ${API_KEY}" \
        https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID>
      
    Result example
    {
       "result": {
          "sessionUuid": {
             "uuid": "24935f24-2c1f62dc-8dd49006-********",
             "userRequestId": "f8d2h7m07t4i********"
          },
          "audioCursors": {
             "receivedDataMs": "7400",
             "resetTimeMs": "0",
             "partialTimeMs": "7400",
             "finalTimeMs": "7400",
             "finalIndex": "0",
             "eouTimeMs": "0"
          },
          "responseWallTimeMs": "189",
          "final": {
             "alternatives": [
                {
                   "words": [
                      {
                         "text": "I",
                         "startTimeMs": "459",
                         "endTimeMs": "520"
                      },
                      {
                         "text": "Yandex",
                         "startTimeMs": "640",
                         "endTimeMs": "1060"
                      },
                      {
                         "text": "SpeechKit",
                         "startTimeMs": "1120",
                         "endTimeMs": "1959"
                      },
                      {
                         "text": "I",
                         "startTimeMs": "2480",
                         "endTimeMs": "2520"
                      },
                      {
                         "text": "can",
                         "startTimeMs": "2580",
                         "endTimeMs": "2800"
                      },
                      {
                         "text: "turn",
                         "startTimeMs": "2860",
                         "endTimeMs": "3360"
                      },
                      {
                         "text": "any",
                         "startTimeMs": "3439",
                         "endTimeMs": "3709"
                      },
                      {
                         "text": "text",
                         "startTimeMs": "3800",
                         "endTimeMs": "4140"
                      },
                      {
                         "text": "into",
                         "startTimeMs": "4200",
                         "endTimeMs": "4220"
                      },
                      {
                         "text": "speech",
                         "startTimeMs": "4279",
                         "endTimeMs": "4740"
                      },
                      {
                         "text": "now",
                         "startTimeMs": "5140",
                         "endTimeMs": "5759"
                      },
                      {
                         "text": "you",
                         "startTimeMs": "5859",
                         "endTimeMs": "5900"
                      },
                      {
                         "text": "can",
                         "startTimeMs": "5980",
                         "endTimeMs": "6399"
                      },
                      {
                         "text": "too",
                         "startTimeMs": "6660",
                         "endTimeMs": "7180"
                      }
                   ],
                   "text": "I'm Yandex SpeechKit I can turn any text into speech now you can too",
                   "startTimeMs": "0",
                   "endTimeMs": "7400",
                   "confidence": 0,
                   "languages": []
                }
             ],
             "channelTag": "0"
          },
          "channelTag": "0"
       }
    }
    {
       "result": {
          "sessionUuid": {
             "uuid": "24935f24-2c1f62dc-8dd49006-********",
             "userRequestId": "f8d2h7m07t4i********"
          },
          "audioCursors": {
             "receivedDataMs": "7400",
             "resetTimeMs": "0",
             "partialTimeMs": "7400",
             "finalTimeMs": "7400",
             "finalIndex": "0",
             "eouTimeMs": "0"
          },
          "responseWallTimeMs": "189",
          "finalRefinement": {
             "finalIndex": "0",
             "normalizedText": {
                "alternatives": [
                   {
                      "words": [
                         {
                            "text": "I",
                            "startTimeMs": "459",
                            "endTimeMs": "520"
                         },
                         {
                            "text": "Yandex",
                            "startTimeMs": "640",
                            "endTimeMs": "1060"
                         },
                         {
                            "text": "SpeechKit",
                            "startTimeMs": "1120",
                            "endTimeMs": "1959"
                         },
                         {
                            "text": "I",
                            "startTimeMs": "2480",
                            "endTimeMs": "2520"
                         },
                         {
                            "text": "can",
                            "startTimeMs": "2580",
                            "endTimeMs": "2800"
                         },
                         {
                            "text": "turn",
                            "startTimeMs": "2860",
                            "endTimeMs": "3360"
                         },
                         {
                            "text": "any",
                            "startTimeMs": "3439",
                            "endTimeMs": "3709"
                         },
                         {
                            "text": "text",
                            "startTimeMs": "3800",
                            "endTimeMs": "4140"
                         },
                         {
                            "text": "into",
                            "startTimeMs": "4200",
                            "endTimeMs": "4220"
                         },
                         {
                            "text": "speech",
                            "startTimeMs": "4279",
                            "endTimeMs": "4740"
                         },
                         {
                            "text": "now",
                            "startTimeMs": "5140",
                            "endTimeMs": "5759"
                         },
                         {
                            "text": "you",
                            "startTimeMs": "5859",
                            "endTimeMs": "5900"
                         },
                         {
                            "text": "can",
                            "startTimeMs": "5980",
                            "endTimeMs": "6399"
                         },
                         {
                            "text": "too",
                            "startTimeMs": "6660",
                            "endTimeMs": "7180"
                         }
                      ],
                      "text": "I'm Yandex SpeechKit I can turn any text into speech now you can too",
                      "startTimeMs": "0",
                      "endTimeMs": "7400",
                      "confidence": 0,
                      "languages": []
                   }
                ],
                "channelTag": "0"
             }
          },
          "channelTag": "0"
       }
    }
    {
       "result": {
          "sessionUuid": {
             "uuid": "24935f24-2c1f62dc-8dd49006-********",
             "userRequestId": "f8d2h7m07t4i********"
          },
          "audioCursors": {
             "receivedDataMs": "7400",
             "resetTimeMs": "0",
             "partialTimeMs": "7400",
             "finalTimeMs": "7400",
             "finalIndex": "0",
             "eouTimeMs": "7400"
          },
          "responseWallTimeMs": "190",
          "eouUpdate": {
             "timeMs": "7400"
          },
          "channelTag": "0"
       }
    }
    
  1. Get a link to an audio file in Object Storage.

  2. Create a file named body.json and add the following code to it:

    {
       "config": {
          "specification": {
             "languageCode": "ru-RU",
             "model": "general",
             "audioEncoding": "LINEAR16_PCM",
             "sampleRateHertz": 8000,
             "audioChannelCount": 1
          }
       },
       "audio": {
          "uri": "<link_to_audio_file>"
       }
    }
    

    Where:

    • languageCode: Recognition language.

    • model: Speech recognition model.

    • audioEncoding: Format of the submitted audio file.

    • sampleRateHertz: Audio file sampling rate in Hz.

    • audioChannelCount: Number of audio channels.

    • uri: Link to the audio file in Object Storage. Here is an example of such a link: https://storage.yandexcloud.net/speechkit/speech.pcm.

      The link contains additional query parameters (after ?) for buckets with restricted access. You do not need to provide these parameters in SpeechKit as they are ignored.

  3. Run the created file:

    export API_KEY=<service_account_API_key> && \
    curl \
      --insecure \
      --header "Authorization: Api-Key ${API_KEY}" \
      --data "@body.json"\
      https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize
    

    Result example:

    {
       "done": false,
       "id": "e03sup6d5h1q********",
       "createdAt": "2019-04-21T22:49:29Z",
       "createdBy": "ajes08feato8********",
       "modifiedAt": "2019-04-21T22:49:29Z"
    }
    

    Save the recognition operation id you get in the response.

  4. Wait until the recognition is completed. It takes about 10 seconds to recognize one minute of single-channel audio.

  5. Send a request to get information about the operation:

    curl \
      --insecure \
      --header "Authorization: Api-key ${API_KEY}" \
      https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID>
    

    Result example:

    {
       "done": true,
       "response": {
          "@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse",
          "chunks": [
             {
                "alternatives": [
                   {
                      "words": [
                         {
                            "startTime": "0.160s",
                            "endTime": "0.500s",
                            "word": "hello",
                            "confidence": 1
                         },
                         {
                            "startTime": "0.580s",
                            "endTime": "0.800s",
                            "word": "world",
                            "confidence": 1
                         }
                      ],
                      "text": "Hello world",
                      "confidence": 1
                   }
                ],
                "channelTag": "1"
             }
          ]
       },
       "id": "e03jjenu23uc********",
       "createdAt": "2024-08-22T11:39:22Z",
       "createdBy": "aje3bg430agh********",
       "modifiedAt": "2024-08-22T11:39:23Z"
    }
    

    If speech recognition in the provided file fails, the response.chunks section may be missing from the response.

See alsoSee also

  • Speech recognition
  • API v3 for asynchronous recognition
  • Asynchronous recognition API v2
  • Authentication with the SpeechKit API
  • Asynchronous WAV audio file recognition using the API v3
  • Asynchronous recognition of OggOpus audio files using the API v2

Was the article helpful?

Previous
How to recognize short audio files in the API v1
Next
How to synthesize speech in the API v1
© 2025 Direct Cursus Technology L.L.C.