Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Yandex SpeechKit
  • SpeechKit technology overview
    • Speech recognition using Playground
    • Speech synthesis using Playground
      • Audio file streaming recognition, API v3
      • Microphone speech streaming recognition, API v3
      • Automatic language detection, API v3
      • Streaming recognition, API v2
      • Synchronous recognition, API v1
      • Asynchronous recognition of WAV audio files, API v3
      • Asynchronous recognition of LPCM format, API v2
      • Asynchronous recognition of OggOpus format, API v2
      • Regular asynchronous recognition of audio files, API v2
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy

In this article:

  • Getting started
  • Perform speech recognition via the API v3
  1. Step-by-step guides
  2. Recognition
  3. Asynchronous recognition of WAV audio files, API v3

Asynchronous WAV audio file recognition using the API v3

Written by
Yandex Cloud
Updated at April 28, 2025
  • Getting started
  • Perform speech recognition via the API v3

The example below illustrates how to use the SpeechKit API v3 for asynchronous speech recognition from a WAV audio file. This example uses the following parameters:

  • Audio stream format: WAV.
  • Speech recognition model: general.
  • Other parameters are left at their defaults.

Authentication is performed under a service account using an API key or IAM token. For more information about authentication in the SpeechKit API, see the API reference.

Getting started

  1. Create a bucket and upload to it the audio file you want to recognize.

  2. Create a service account.

    Warning

    You can recognize audio files asynchronously only as a service account. Do not use any other accounts Yandex Cloud for the purpose.

  3. Assign to the service account the storage.uploader and ai.speechkit-stt.user roles for the folder you had created the bucket in.

  4. Get an IAM token or API key for the created service account.

If you do not have a WAV audio file, you can use this sample file.

Perform speech recognition via the API v3

cURL
Python 3
  1. Get a link to an audio file in Object Storage.

  2. Create a file, e.g., request.json, and add the following code to it:

    {
      "uri": "https://storage.yandexcloud.net/<bucket_name>/<path_to_WAV_file_in_bucket>",
      "recognition_model": {
        "model": "general",
        "audio_format": {
          "container_audio": {
            "container_audio_type": "WAV"
          }
        }
      }
    }
    

    Where:

    • uri: Link to the audio file in Object Storage. Here is an example of such a link: https://storage.yandexcloud.net/speechkit/speech.wav.

      The link contains additional query parameters (after ?) for buckets with restricted access. You do not need to provide these parameters in SpeechKit as they are ignored.

    • model: Speech recognition model.

    • container_audio_type: Audio container format.

  3. Run the request using one of the service account authentication methods:

    • With an IAM token:

      export IAM_TOKEN=<service_account_IAM_token> && \
      curl \
        --insecure \
        --header "Authorization: Bearer ${IAM_TOKEN}" \
        --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsync
      

      Where IAM_TOKEN is the IAM token of the service account.

    • With an API key.

      Use API keys if requesting an IAM token automatically is not an option.

      export API_KEY=<service_account_API_key> && \
      curl \
        --insecure \
        --header "Authorization: Api-Key ${API_KEY}" \
        --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsync
      

    Result example:

    {
       "id":"f8ddr61b30fk********",
       "description":"STT v3 async recognition",
       "createdAt":"2024-07-15T07:39:36Z",
       "createdBy":"ajehumcuv38h********",
       "modifiedAt":"2024-07-15T07:39:36Z",
       "done":false,
       "metadata":null
    }
    

    Save the recognition operation id you get in the response.

  4. Wait until the recognition is complete. It takes about 10 seconds to recognize one minute of audio.

  5. Request information about the operation:

    • Authentication with an IAM token:

      curl \
        --insecure \
        --request GET \
        --header "Authorization: Bearer ${IAM_TOKEN}" \
        https://operation.stt.api.cloud.yandex.net/operations/<recognition_operation_ID>
      
    • Authentication with an API key:

      curl \
        --insecure \
        --request GET \
        --header "Authorization: Api-key ${API_KEY}" \
        https://operation.stt.api.cloud.yandex.net/operations/<recognition_operation_ID>
      

    Result example:

    {
       "done": true,
       "id": "f8ddr61b30fk********",
       "description": "STT v3 async recognition",
       "createdAt": "2024-07-15T07:39:36Z",
       "createdBy": "ajehumcuv38h********",
       "modifiedAt": "2024-07-15T07:39:37Z"
    }
    
  6. Request the operation result:

    • Authentication with an IAM token:

      curl \
        --insecure \
        --request GET \
        --header "Authorization: Bearer ${IAM_TOKEN}" \
        https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID>
      
    • Authentication with an API key:

      curl \
        --insecure \
        --request GET \
        --header "Authorization: Api-key ${API_KEY}" \
        https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID>
      
    Result example
    {
       "result": {
          "sessionUuid": {
             "uuid": "24935f24-2c1f62dc-8dd49006-********",
             "userRequestId": "f8d2h7m07t4i********"
          },
          "audioCursors": {
             "receivedDataMs": "7400",
             "resetTimeMs": "0",
             "partialTimeMs": "7400",
             "finalTimeMs": "7400",
             "finalIndex": "0",
             "eouTimeMs": "0"
          },
          "responseWallTimeMs": "189",
          "final": {
             "alternatives": [
                {
                   "words": [
                      {
                         "text": "I",
                         "startTimeMs": "459",
                         "endTimeMs": "520"
                      },
                      {
                         "text": "Yandex",
                         "startTimeMs": "640",
                         "endTimeMs": "1060"
                      },
                      {
                         "text": "SpeechKit",
                         "startTimeMs": "1120",
                         "endTimeMs": "1959"
                      },
                      {
                         "text": "I",
                         "startTimeMs": "2480",
                         "endTimeMs": "2520"
                      },
                      {
                         "text": "can",
                         "startTimeMs": "2580",
                         "endTimeMs": "2800"
                      },
                      {
                         "text: "turn",
                         "startTimeMs": "2860",
                         "endTimeMs": "3360"
                      },
                      {
                         "text": "any",
                         "startTimeMs": "3439",
                         "endTimeMs": "3709"
                      },
                      {
                         "text": "text",
                         "startTimeMs": "3800",
                         "endTimeMs": "4140"
                      },
                      {
                         "text": "into",
                         "startTimeMs": "4200",
                         "endTimeMs": "4220"
                      },
                      {
                         "text": "speech",
                         "startTimeMs": "4279",
                         "endTimeMs": "4740"
                      },
                      {
                         "text": "now",
                         "startTimeMs": "5140",
                         "endTimeMs": "5759"
                      },
                      {
                         "text": "you",
                         "startTimeMs": "5859",
                         "endTimeMs": "5900"
                      },
                      {
                         "text": "can",
                         "startTimeMs": "5980",
                         "endTimeMs": "6399"
                      },
                      {
                         "text": "too",
                         "startTimeMs": "6660",
                         "endTimeMs": "7180"
                      }
                   ],
                   "text": "I'm Yandex SpeechKit I can turn any text into speech now you can too",
                   "startTimeMs": "0",
                   "endTimeMs": "7400",
                   "confidence": 0,
                   "languages": []
                }
             ],
             "channelTag": "0"
          },
          "channelTag": "0"
       }
    }
    {
       "result": {
          "sessionUuid": {
             "uuid": "24935f24-2c1f62dc-8dd49006-********",
             "userRequestId": "f8d2h7m07t4i********"
          },
          "audioCursors": {
             "receivedDataMs": "7400",
             "resetTimeMs": "0",
             "partialTimeMs": "7400",
             "finalTimeMs": "7400",
             "finalIndex": "0",
             "eouTimeMs": "0"
          },
          "responseWallTimeMs": "189",
          "finalRefinement": {
             "finalIndex": "0",
             "normalizedText": {
                "alternatives": [
                   {
                      "words": [
                         {
                            "text": "I",
                            "startTimeMs": "459",
                            "endTimeMs": "520"
                         },
                         {
                            "text": "Yandex",
                            "startTimeMs": "640",
                            "endTimeMs": "1060"
                         },
                         {
                            "text": "SpeechKit",
                            "startTimeMs": "1120",
                            "endTimeMs": "1959"
                         },
                         {
                            "text": "I",
                            "startTimeMs": "2480",
                            "endTimeMs": "2520"
                         },
                         {
                            "text": "can",
                            "startTimeMs": "2580",
                            "endTimeMs": "2800"
                         },
                         {
                            "text": "turn",
                            "startTimeMs": "2860",
                            "endTimeMs": "3360"
                         },
                         {
                            "text": "any",
                            "startTimeMs": "3439",
                            "endTimeMs": "3709"
                         },
                         {
                            "text": "text",
                            "startTimeMs": "3800",
                            "endTimeMs": "4140"
                         },
                         {
                            "text": "into",
                            "startTimeMs": "4200",
                            "endTimeMs": "4220"
                         },
                         {
                            "text": "speech",
                            "startTimeMs": "4279",
                            "endTimeMs": "4740"
                         },
                         {
                            "text": "now",
                            "startTimeMs": "5140",
                            "endTimeMs": "5759"
                         },
                         {
                            "text": "you",
                            "startTimeMs": "5859",
                            "endTimeMs": "5900"
                         },
                         {
                            "text": "can",
                            "startTimeMs": "5980",
                            "endTimeMs": "6399"
                         },
                         {
                            "text": "too",
                            "startTimeMs": "6660",
                            "endTimeMs": "7180"
                         }
                      ],
                      "text": "I'm Yandex SpeechKit I can turn any text into speech now you can too",
                      "startTimeMs": "0",
                      "endTimeMs": "7400",
                      "confidence": 0,
                      "languages": []
                   }
                ],
                "channelTag": "0"
             }
          },
          "channelTag": "0"
       }
    }
    {
       "result": {
          "sessionUuid": {
             "uuid": "24935f24-2c1f62dc-8dd49006-********",
             "userRequestId": "f8d2h7m07t4i********"
          },
          "audioCursors": {
             "receivedDataMs": "7400",
             "resetTimeMs": "0",
             "partialTimeMs": "7400",
             "finalTimeMs": "7400",
             "finalIndex": "0",
             "eouTimeMs": "7400"
          },
          "responseWallTimeMs": "190",
          "eouUpdate": {
             "timeMs": "7400"
          },
          "channelTag": "0"
       }
    }
    
  1. Clone the Yandex Cloud API repository:

    git clone https://github.com/yandex-cloud/cloudapi
    
  2. Use the pip package manager to install the grpcio-tools package:

    pip install grpcio-tools
    
  3. Go to the folder hosting the cloned Yandex Cloud API repository, create a folder named output, and generate the client interface code there:

    cd <path_to_cloudapi_folder>
    mkdir output
    python3 -m grpc_tools.protoc -I . -I third_party/googleapis \
      --python_out=output \
      --grpc_python_out=output \
      google/api/http.proto \
      google/api/annotations.proto \
      yandex/cloud/api/operation.proto \
      google/rpc/status.proto \
      yandex/cloud/operation/operation.proto \
      yandex/cloud/validation.proto \
      yandex/cloud/ai/stt/v3/stt_service.proto \
      yandex/cloud/ai/stt/v3/stt.proto
    

    The stt_pb2.py, stt_pb2_grpc.py, stt_service_pb2.py, and stt_service_pb2_grpc.py client interface files, as well as dependency files, will be created in the output folder.

  4. Create a file, e.g., test.py, in the output folder root and add the following API request code to it:

    import grpc
    from yandex.cloud.ai.stt.v3 import stt_pb2, stt_service_pb2_grpc
    
    request = stt_pb2.RecognizeFileRequest(
      uri='https://storage.yandexcloud.net/<bucket_name>/<path_to_WAV_file_in_bucket>',
      recognition_model=stt_pb2.RecognitionModelOptions(
        model='general',
        audio_format=stt_pb2.AudioFormatOptions(
          container_audio=stt_pb2.ContainerAudio(
            container_audio_type=stt_pb2.ContainerAudio.WAV
          )
        )
      )
    )
    
    cred = grpc.ssl_channel_credentials()
    chan = grpc.secure_channel('stt.api.cloud.yandex.net:443', cred)
    stub = stt_service_pb2_grpc.AsyncRecognizerStub(chan)
    
    # Choose one of the authentication methods:
    
    # Authentication with an IAM token
    response = stub.RecognizeFile(request, metadata=[('authorization', 'Bearer <IAM_token>')])
    
    # Authentication with an API key
    # response = stub.RecognizeFile(request, metadata=[('authorization', 'Api-Key <API_key>')])
    
    print(response)
    
  5. Run this request:

    python3 test.py
    

    Result:

    id: "f8dem628l2mq********"
    description: "STT v3 async recognition"
    created_at {
      seconds: 1721032219
    }
    created_by: "ajehumcuv38h********"
    modified_at {
      seconds: 1721032219
    }
    

    Save the recognition operation id you get in the response.

  6. Create a file, e.g., results.py, in the output folder root and add the following code to it to get the operation result:

    import grpc
    from yandex.cloud.ai.stt.v3 import stt_pb2, stt_service_pb2_grpc, stt_service_pb2
    
    request = stt_service_pb2.GetRecognitionRequest(
        operation_id="<operation_ID>"
    )
    
    cred = grpc.ssl_channel_credentials()
    chan = grpc.secure_channel('stt.api.cloud.yandex.net:443', cred)
    stub = stt_service_pb2_grpc.AsyncRecognizerStub(chan)
    
    # Authentication with an IAM token
    response = stub.GetRecognition(request, metadata=[('authorization', 'Bearer <IAM_token>')])
    
    # Authentication with an API key
    # response = stub.GetRecognition(request, metadata=[('authorization', 'Api-Key <API_key>')])
    
    print(list(response))
    
  7. Run this request:

    python3 results.py
    
    Result example
    [session_uuid {
      uuid: "df49eaa2-25a55218-ae967fa1-********"
      user_request_id: "f8dkup42nmhk********"
    }
    audio_cursors {
      received_data_ms: 6600
      partial_time_ms: 6600
      final_time_ms: 6600
    }
    response_wall_time_ms: 204
    final {
      alternatives {
        words {
          text: "I"
          start_time_ms: 380
          end_time_ms: 420
        }
        words {
          "text": "Yandex"
          start_time_ms: 539
          end_time_ms: 919
        }
        words {
          "text": "SpeechKit"
          start_time_ms: 960
          end_time_ms: 1719
        }
        words {
          text: "I"
          start_time_ms: 2159
          end_time_ms: 2200
        }
        words {
          "text": "can"
          start_time_ms: 2260
          end_time_ms: 2440
        }
        words {
          text: "turn"
          start_time_ms: 2520
          end_time_ms: 3000
        }
        words {
          "text": "any"
          start_time_ms: 3060
          end_time_ms: 3320
        }
        words {
          "text": "text"
          start_time_ms: 3419
          end_time_ms: 3740
        }
        words {
          "text": "into"
          start_time_ms: 3780
          end_time_ms: 3800
        }
        words {
          "text": "speech"
          start_time_ms: 3860
          end_time_ms: 4279
        }
        words {
          "text": "now"
          start_time_ms: 4680
          end_time_ms: 5240
        }
        words {
          "text": "you"
          start_time_ms: 5339
          end_time_ms: 5380
        }
        words {
          "text": "can"
          start_time_ms: 5460
          end_time_ms: 5766
        }
        words {
          text: "too"
          start_time_ms: 5920
          end_time_ms: 6393
        }
        text: "I'm Yandex SpeechKit I can turn any text into speech now you can too"
        end_time_ms: 6600
      }
      channel_tag: "0"
    }
    channel_tag: "0"
    , session_uuid {
      uuid: "df49eaa2-25a55218-ae967fa1-********"
      user_request_id: "f8dkup42nmhk********"
    }
    audio_cursors {
      received_data_ms: 6600
      partial_time_ms: 6600
      final_time_ms: 6600
    }
    response_wall_time_ms: 204
    final_refinement {
      normalized_text {
        alternatives {
          words {
            text: "I"
            start_time_ms: 380
            end_time_ms: 420
          }
          words {
            "text": "Yandex"
            start_time_ms: 539
            end_time_ms: 919
          }
          words {
            "text": "SpeechKit"
            start_time_ms: 960
            end_time_ms: 1719
          }
          words {
            text: "I"
            start_time_ms: 2159
            end_time_ms: 2200
          }
          words {
            "text": "can"
            start_time_ms: 2260
            end_time_ms: 2440
          }
          words {
            text: "turn"
            start_time_ms: 2520
            end_time_ms: 3000
          }
          words {
            "text": "any"
            start_time_ms: 3060
            end_time_ms: 3320
          }
          words {
            "text": "text"
            start_time_ms: 3419
            end_time_ms: 3740
          }
          words {
            "text": "into"
            start_time_ms: 3780
            end_time_ms: 3800
          }
          words {
            "text": "speech"
            start_time_ms: 3860
            end_time_ms: 4279
          }
          words {
            "text": "now"
            start_time_ms: 4680
            end_time_ms: 5240
          }
          words {
            "text": "you"
            start_time_ms: 5339
            end_time_ms: 5380
          }
          words {
            "text": "can"
            start_time_ms: 5460
            end_time_ms: 5766
          }
          words {
            text: "too"
            start_time_ms: 5920
            end_time_ms: 6393
          }
          text: "I'm Yandex SpeechKit I can turn any text into speech now you can too"
          end_time_ms: 6600
        }
        channel_tag: "0"
      }
    }
    channel_tag: "0"
    , session_uuid {
      uuid: "df49eaa2-25a55218-ae967fa1-********"
      user_request_id: "f8dkup42nmhk********"
    }
    audio_cursors {
      received_data_ms: 6600
      partial_time_ms: 6600
      final_time_ms: 6600
      eou_time_ms: 6600
    }
    response_wall_time_ms: 204
    eou_update {
      time_ms: 6600
    }
    channel_tag: "0"
    ]
    

Was the article helpful?

Previous
Synchronous recognition, API v1
Next
Asynchronous recognition of LPCM format, API v2
Yandex project
© 2025 Yandex.Cloud LLC