How to recognize long audio files in SpeechKit
The service can recognize speech in different ways. The provided example demonstrates asynchronous recognition of an audio file. Asynchronous recognition is available via API v3 and API v2. Asynchronous recognition is subject to these restrictions:
- Maximum audio duration: 4 hours
- Maximum file size: 1 GB
In the example, the API is used via the cURL
Getting started
-
Create a bucket and upload to it the audio file you want to recognize.
-
Create a service account.
Warning
You can recognize audio files asynchronously only as a service account. Do not use any other Yandex Cloud accounts for the purpose.
-
Assign to the service account the
storage.uploaderandai.speechkit-stt.userroles for the folder you had created the bucket in. -
Get an API key or IAM token for your service account.
-
Download a sample audio file:
- For API v3: a WAV file.
- For API v2: an LPCM file.
Speech recognition
-
Get a link to an audio file in Object Storage.
-
Create a file, e.g.,
request.json, and add the following code to it:{ "uri": "https://storage.yandexcloud.net/<bucket_name>/<path_to_WAV_file_in_bucket>", "recognition_model": { "model": "general", "audio_format": { "container_audio": { "container_audio_type": "WAV" } } } }Where:
-
uri: Link to the audio file in Object Storage. Here is an example of such a link:https://storage.yandexcloud.net/speechkit/speech.wav.The link contains additional query parameters (after
?) for buckets with restricted access. You do not need to provide these parameters in SpeechKit as they are ignored. -
model: Speech recognition model. -
container_audio_type: Audio container format.
-
-
Run the request using one of the service account authentication methods:
-
With an IAM token:
export FOLDER_ID=<folder_ID> export IAM_TOKEN=<service_account_IAM_token> && \ curl \ --insecure \ --header "Authorization: Bearer ${IAM_TOKEN}" \ --header "x-folder-id: ${FOLDER_ID}" \ --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsyncWhere:
FOLDER_ID: ID of the folder your service account was created in.IAM_TOKEN: Service account IAM token.
-
With an API key.
Use API keys if requesting an IAM token automatically is not an option.
export FOLDER_ID=<folder_ID> export API_KEY=<service_account_API_key> && \ curl \ --insecure \ --header "Authorization: Api-Key ${API_KEY}" \ --header "x-folder-id: ${FOLDER_ID}" \ --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsync
Result example:
{ "id":"f8ddr61b30fk********", "description":"STT v3 async recognition", "createdAt":"2024-07-15T07:39:36Z", "createdBy":"ajehumcuv38h********", "modifiedAt":"2024-07-15T07:39:36Z", "done":false, "metadata":null }Save the recognition operation
idyou get in the response. -
-
Wait until the recognition is complete. It takes about 10 seconds to recognize one minute of audio.
-
Request information about the operation:
-
Authentication with an IAM token:
curl \ --insecure \ --request GET \ --header "Authorization: Bearer ${IAM_TOKEN}" \ --header "x-folder-id: ${FOLDER_ID}" \ https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID> -
Authentication with an API key:
curl \ --insecure \ --request GET \ --header "Authorization: Api-key ${API_KEY}" \ --header "x-folder-id: ${FOLDER_ID}" \ https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID>
Result example:
{ "done": true, "id": "f8ddr61b30fk********", "description": "STT v3 async recognition", "createdAt": "2024-07-15T07:39:36Z", "createdBy": "ajehumcuv38h********", "modifiedAt": "2024-07-15T07:39:37Z" } -
-
Request the operation result:
-
Authentication with an IAM token:
curl \ --insecure \ --request GET \ --header "Authorization: Bearer ${IAM_TOKEN}" \ --header "x-folder-id: ${FOLDER_ID}" \ https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID> -
Authentication with an API key:
curl \ --insecure \ --request GET \ --header "Authorization: Api-key ${API_KEY}" \ --header "x-folder-id: ${FOLDER_ID}" \ https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID>
Result example
{ "result": { "sessionUuid": { "uuid": "24935f24-2c1f62dc-8dd49006-********", "userRequestId": "f8d2h7m07t4i********" }, "audioCursors": { "receivedDataMs": "7400", "resetTimeMs": "0", "partialTimeMs": "7400", "finalTimeMs": "7400", "finalIndex": "0", "eouTimeMs": "0" }, "responseWallTimeMs": "189", "final": { "alternatives": [ { "words": [ { "text": "я", "startTimeMs": "459", "endTimeMs": "520" }, { "text": "яндекс", "startTimeMs": "640", "endTimeMs": "1060" }, { "text": "спичкит", "startTimeMs": "1120", "endTimeMs": "1959" }, { "text": "я", "startTimeMs": "2480", "endTimeMs": "2520" }, { "text": "могу", "startTimeMs": "2580", "endTimeMs": "2800" }, { "text": "превратить", "startTimeMs": "2860", "endTimeMs": "3360" }, { "text": "любой", "startTimeMs": "3439", "endTimeMs": "3709" }, { "text": "текст", "startTimeMs": "3800", "endTimeMs": "4140" }, { "text": "в", "startTimeMs": "4200", "endTimeMs": "4220" }, { "text": "речь", "startTimeMs": "4279", "endTimeMs": "4740" }, { "text": "теперь", "startTimeMs": "5140", "endTimeMs": "5759" }, { "text": "и", "startTimeMs": "5859", "endTimeMs": "5900" }, { "text": "вы", "startTimeMs": "5980", "endTimeMs": "6399" }, { "text": "можете", "startTimeMs": "6660", "endTimeMs": "7180" } ], "text": "я яндекс спичкит я могу превратить любой текст в речь теперь и вы можете", "startTimeMs": "0", "endTimeMs": "7400", "confidence": 0, "languages": [] } ], "channelTag": "0" }, "channelTag": "0" } } { "result": { "sessionUuid": { "uuid": "24935f24-2c1f62dc-8dd49006-********", "userRequestId": "f8d2h7m07t4i********" }, "audioCursors": { "receivedDataMs": "7400", "resetTimeMs": "0", "partialTimeMs": "7400", "finalTimeMs": "7400", "finalIndex": "0", "eouTimeMs": "0" }, "responseWallTimeMs": "189", "finalRefinement": { "finalIndex": "0", "normalizedText": { "alternatives": [ { "words": [ { "text": "я", "startTimeMs": "459", "endTimeMs": "520" }, { "text": "яндекс", "startTimeMs": "640", "endTimeMs": "1060" }, { "text": "спичкит", "startTimeMs": "1120", "endTimeMs": "1959" }, { "text": "я", "startTimeMs": "2480", "endTimeMs": "2520" }, { "text": "могу", "startTimeMs": "2580", "endTimeMs": "2800" }, { "text": "превратить", "startTimeMs": "2860", "endTimeMs": "3360" }, { "text": "любой", "startTimeMs": "3439", "endTimeMs": "3709" }, { "text": "текст", "startTimeMs": "3800", "endTimeMs": "4140" }, { "text": "в", "startTimeMs": "4200", "endTimeMs": "4220" }, { "text": "речь", "startTimeMs": "4279", "endTimeMs": "4740" }, { "text": "теперь", "startTimeMs": "5140", "endTimeMs": "5759" }, { "text": "и", "startTimeMs": "5859", "endTimeMs": "5900" }, { "text": "вы", "startTimeMs": "5980", "endTimeMs": "6399" }, { "text": "можете", "startTimeMs": "6660", "endTimeMs": "7180" } ], "text": "Я яндекс спичкит я могу превратить любой текст в речь теперь и вы можете", "startTimeMs": "0", "endTimeMs": "7400", "confidence": 0, "languages": [] } ], "channelTag": "0" } }, "channelTag": "0" } } { "result": { "sessionUuid": { "uuid": "24935f24-2c1f62dc-8dd49006-********", "userRequestId": "f8d2h7m07t4i********" }, "audioCursors": { "receivedDataMs": "7400", "resetTimeMs": "0", "partialTimeMs": "7400", "finalTimeMs": "7400", "finalIndex": "0", "eouTimeMs": "7400" }, "responseWallTimeMs": "190", "eouUpdate": { "timeMs": "7400" }, "channelTag": "0" } } -
-
Get a link to an audio file in Object Storage.
-
Create a file named
body.jsonand add the following code to it:{ "config": { "specification": { "languageCode": "ru-RU", "model": "general", "audioEncoding": "LINEAR16_PCM", "sampleRateHertz": 8000, "audioChannelCount": 1 } }, "audio": { "uri": "<link_to_audio_file>" } }Where:
-
languageCode: Recognition language. -
model: Speech recognition model. -
audioEncoding: Format of the submitted audio file. -
sampleRateHertz: Audio file sampling rate in Hz. -
audioChannelCount: Number of audio channels. -
uri: Link to the audio file in Object Storage. Here is an example of such a link:https://storage.yandexcloud.net/speechkit/speech.pcm.The link contains additional query parameters (after
?) for buckets with restricted access. You do not need to provide these parameters in SpeechKit as they are ignored.
-
-
Run the created file:
export API_KEY=<service_account_API_key> && \ curl \ --insecure \ --header "Authorization: Api-Key ${API_KEY}" \ --data "@body.json"\ https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognizeResult example:
{ "done": false, "id": "e03sup6d5h1q********", "createdAt": "2019-04-21T22:49:29Z", "createdBy": "ajes08feato8********", "modifiedAt": "2019-04-21T22:49:29Z" }Save the recognition operation
idyou get in the response. -
Wait until the recognition is completed. It takes about 10 seconds to recognize one minute of single-channel audio.
-
Send a request to get information about the operation:
curl \ --insecure \ --header "Authorization: Api-key ${API_KEY}" \ https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID>Result example:
{ "done": true, "response": { "@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse", "chunks": [ { "alternatives": [ { "words": [ { "startTime": "0.160s", "endTime": "0.500s", "word": "hello", "confidence": 1 }, { "startTime": "0.580s", "endTime": "0.800s", "word": "world", "confidence": 1 } ], "text": "Hello world", "confidence": 1 } ], "channelTag": "1" } ] }, "id": "e03jjenu23uc********", "createdAt": "2024-08-22T11:39:22Z", "createdBy": "aje3bg430agh********", "modifiedAt": "2024-08-22T11:39:23Z" }If speech recognition in the provided file fails, the
response.chunkssection may be missing from the response.