Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex SpeechKit
  • SpeechKit technology overview
    • Overview
    • API authentication
        • Overview
          • Overview
          • RecognizeFile
          • GetRecognition
          • DeleteRecognition
  • Supported audio formats
  • IVR integration
  • Quotas and limits
  • Access management
  • Pricing policy
  • Audit Trails events

In this article:

  • HTTP request
  • Body parameters
  • RecognitionModelOptions
  • AudioFormatOptions
  • RawAudio
  • ContainerAudio
  • TextNormalizationOptions
  • LanguageRestrictionOptions
  • RecognitionClassifierOptions
  • RecognitionClassifier
  • SpeechAnalysisOptions
  • SpeakerLabelingOptions
  • SummarizationOptions
  • SummarizationProperty
  • JsonSchema
  • Response
  • Status
  1. API references
  2. Recognition
  3. API v3 REST
  4. AsyncRecognizer
  5. RecognizeFile

SpeechKit Recognition API v3, REST: AsyncRecognizer.RecognizeFile

Written by
Yandex Cloud
Updated at October 30, 2025
  • HTTP request
  • Body parameters
  • RecognitionModelOptions
  • AudioFormatOptions
  • RawAudio
  • ContainerAudio
  • TextNormalizationOptions
  • LanguageRestrictionOptions
  • RecognitionClassifierOptions
  • RecognitionClassifier
  • SpeechAnalysisOptions
  • SpeakerLabelingOptions
  • SummarizationOptions
  • SummarizationProperty
  • JsonSchema
  • Response
  • Status

Performs asynchronous speech recognition.

HTTP requestHTTP request

POST https://stt.api.cloud.yandex.net/stt/v3/recognizeFileAsync

Body parametersBody parameters

{
  // Includes only one of the fields `content`, `uri`
  "content": "string",
  "uri": "string",
  // end of the list of possible fields
  "recognitionModel": {
    "model": "string",
    "audioFormat": {
      // Includes only one of the fields `rawAudio`, `containerAudio`
      "rawAudio": {
        "audioEncoding": "string",
        "sampleRateHertz": "string",
        "audioChannelCount": "string"
      },
      "containerAudio": {
        "containerAudioType": "string"
      }
      // end of the list of possible fields
    },
    "textNormalization": {
      "textNormalization": "string",
      "profanityFilter": "boolean",
      "literatureText": "boolean",
      "phoneFormattingMode": "string"
    },
    "languageRestriction": {
      "restrictionType": "string",
      "languageCode": [
        "string"
      ]
    },
    "audioProcessingType": "string"
  },
  "recognitionClassifier": {
    "classifiers": [
      {
        "classifier": "string",
        "triggers": [
          "string"
        ]
      }
    ]
  },
  "speechAnalysis": {
    "enableSpeakerAnalysis": "boolean",
    "enableConversationAnalysis": "boolean",
    "descriptiveStatisticsQuantiles": [
      "string"
    ]
  },
  "speakerLabeling": {
    "speakerLabeling": "string"
  },
  "summarization": {
    "modelUri": "string",
    "properties": [
      {
        "instruction": "string",
        // Includes only one of the fields `jsonObject`, `jsonSchema`
        "jsonObject": "boolean",
        "jsonSchema": {
          "schema": "object"
        }
        // end of the list of possible fields
      }
    ]
  }
}

Field

Description

content

string (bytes)

Bytes with data

Includes only one of the fields content, uri.

uri

string

S3 data URL

Includes only one of the fields content, uri.

recognitionModel

RecognitionModelOptions

Configuration for speech recognition model.

recognitionClassifier

RecognitionClassifierOptions

Configuration for classifiers over speech recognition.

speechAnalysis

SpeechAnalysisOptions

Configuration for speech analysis over speech recognition.

speakerLabeling

SpeakerLabelingOptions

Configuration for speaker labeling

summarization

SummarizationOptions

Summarization options

RecognitionModelOptionsRecognitionModelOptions

Field

Description

model

string

Sets the recognition model for the cloud version of SpeechKit.
For Recognizer.RecognizeStreaming, possible values are general, general:rc, general:deprecated.
For AsyncRecognizer.RecognizeFile, possible values are general, general:rc, general:deprecated, deferred-general, deferred-general:rc, and deferred-general:deprecated.
The model is ignored for SpeechKit Hybrid.

audioFormat

AudioFormatOptions

Specified input audio.

textNormalization

TextNormalizationOptions

Text normalization options.

languageRestriction

LanguageRestrictionOptions

Possible languages in audio.

audioProcessingType

enum (AudioProcessingType)

For Recognizer.RecognizeStreaming, defines the audio data processing mode. Default is REAL_TIME.
For AsyncRecognizer.RecognizeFile, this field is ignored.

  • AUDIO_PROCESSING_TYPE_UNSPECIFIED
  • REAL_TIME: Process audio in mode optimized for real-time recognition, i.e. send partials and final responses as soon as possible.
  • FULL_DATA: Process audio after all data was received.

AudioFormatOptionsAudioFormatOptions

Audio format options.

Field

Description

rawAudio

RawAudio

RAW audio without container.

Includes only one of the fields rawAudio, containerAudio.

containerAudio

ContainerAudio

Audio is wrapped in container.

Includes only one of the fields rawAudio, containerAudio.

RawAudioRawAudio

RAW Audio format spec (no container to infer type). Used in AudioFormat options.

Field

Description

audioEncoding

enum (AudioEncoding)

Type of audio encoding.

  • AUDIO_ENCODING_UNSPECIFIED
  • LINEAR16_PCM: Audio bit depth 16-bit signed little-endian (Linear PCM).

sampleRateHertz

string (int64)

PCM sample rate.

audioChannelCount

string (int64)

PCM channel count. Currently only single channel audio is supported in real-time recognition.

ContainerAudioContainerAudio

Audio with fixed type in container. Used in AudioFormat options.

Field

Description

containerAudioType

enum (ContainerAudioType)

Type of audio container.

  • CONTAINER_AUDIO_TYPE_UNSPECIFIED
  • WAV: Audio bit depth 16-bit signed little-endian (Linear PCM).
  • OGG_OPUS: Data is encoded using the OPUS audio codec and compressed using the OGG container format.
  • MP3: Data is encoded using MPEG-1/2 Layer III and compressed using the MP3 container format.

TextNormalizationOptionsTextNormalizationOptions

Options for post-processing text results. The normalization levels depend on the settings and the language.
For detailed information, see documentation.

Field

Description

textNormalization

enum (TextNormalization)

  • TEXT_NORMALIZATION_UNSPECIFIED
  • TEXT_NORMALIZATION_ENABLED: Enable converting numbers, dates and time from text to numeric format.
  • TEXT_NORMALIZATION_DISABLED: Disable all normalization. Default value.

profanityFilter

boolean

Profanity filter (default: false).

literatureText

boolean

Rewrite text in literature style (default: false).

phoneFormattingMode

enum (PhoneFormattingMode)

Define phone formatting mode

  • PHONE_FORMATTING_MODE_UNSPECIFIED
  • PHONE_FORMATTING_MODE_DISABLED: Disable phone formatting

LanguageRestrictionOptionsLanguageRestrictionOptions

Type of restriction for the list of languages expected in the incoming audio.

Field

Description

restrictionType

enum (LanguageRestrictionType)

Language restriction type.
All of these restrictions are used by the model as guidelines, not as strict rules.
The language is recognized for each sentence. If a sentence has phrases in different languages, all of them will be transcribed in the most probable language.

  • LANGUAGE_RESTRICTION_TYPE_UNSPECIFIED
  • WHITELIST: The list of most possible languages in the incoming audio.
  • BLACKLIST: The list of languages that are likely not to be included in the incoming audio.

languageCode[]

string

The list of language codes to restrict recognition in the case of an auto model.

RecognitionClassifierOptionsRecognitionClassifierOptions

Field

Description

classifiers[]

RecognitionClassifier

List of classifiers to use. For detailed information and usage example, see documentation.

RecognitionClassifierRecognitionClassifier

Field

Description

classifier

string

Classifier name

triggers[]

enum (TriggerType)

Describes the types of responses to which the classification results will come. Classification responses will follow the responses of the specified types.

  • TRIGGER_TYPE_UNSPECIFIED
  • ON_UTTERANCE: Apply classifier to utterance responses.
  • ON_FINAL: Apply classifier to final responses.
  • ON_PARTIAL: Apply classifier to partial responses.

SpeechAnalysisOptionsSpeechAnalysisOptions

Field

Description

enableSpeakerAnalysis

boolean

Analyse speech for every speaker

enableConversationAnalysis

boolean

Analyse conversation of two speakers

descriptiveStatisticsQuantiles[]

string

Quantile levels in range (0, 1) for descriptive statistics

SpeakerLabelingOptionsSpeakerLabelingOptions

Field

Description

speakerLabeling

enum (SpeakerLabeling)

Specifies the execution of speaker labeling.

  • SPEAKER_LABELING_UNSPECIFIED
  • SPEAKER_LABELING_ENABLED: Enable speaker labeling.
  • SPEAKER_LABELING_DISABLED: Disable speaker labeling. Default value.

SummarizationOptionsSummarizationOptions

Represents transcription summarization options.

Field

Description

modelUri

string

The ID of the model to be used for completion generation.

properties[]

SummarizationProperty

A list of suimmarizations to perform with transcription.

SummarizationPropertySummarizationProperty

Represents summarization entry for transcription.

Field

Description

instruction

string

Summarization instruction for model.

jsonObject

boolean

When set to true, the model will return a valid JSON object.
Be sure to ask the model explicitly for JSON.
Otherwise, it may produce excessive whitespace and run indefinitely until it reaches the token limit.

Includes only one of the fields jsonObject, jsonSchema.

Specifies the format of the model's response.

jsonSchema

JsonSchema

Enforces a specific JSON structure for the model's response based on a provided schema.

Includes only one of the fields jsonObject, jsonSchema.

Specifies the format of the model's response.

JsonSchemaJsonSchema

Represents the expected structure of the model's response using a JSON Schema.

Field

Description

schema

object

The JSON Schema that the model's output must conform to.

ResponseResponse

HTTP Code: 200 - OK

{
  "id": "string",
  "description": "string",
  "createdAt": "string",
  "createdBy": "string",
  "modifiedAt": "string",
  "done": "boolean",
  "metadata": "object",
  // Includes only one of the fields `error`
  "error": {
    "code": "integer",
    "message": "string",
    "details": [
      "object"
    ]
  }
  // end of the list of possible fields
}

An Operation resource. For more information, see Operation.

Field

Description

id

string

ID of the operation.

description

string

Description of the operation. 0-256 characters long.

createdAt

string (date-time)

Creation timestamp.

String in RFC3339 text format. The range of possible values is from
0001-01-01T00:00:00Z to 9999-12-31T23:59:59.999999999Z, i.e. from 0 to 9 digits for fractions of a second.

To work with values in this field, use the APIs described in the
Protocol Buffers reference.
In some languages, built-in datetime utilities do not support nanosecond precision (9 digits).

createdBy

string

ID of the user or service account who initiated the operation.

modifiedAt

string (date-time)

The time when the Operation resource was last modified.

String in RFC3339 text format. The range of possible values is from
0001-01-01T00:00:00Z to 9999-12-31T23:59:59.999999999Z, i.e. from 0 to 9 digits for fractions of a second.

To work with values in this field, use the APIs described in the
Protocol Buffers reference.
In some languages, built-in datetime utilities do not support nanosecond precision (9 digits).

done

boolean

If the value is false, it means the operation is still in progress.
If true, the operation is completed, and either error or response is available.

metadata

object

Service-specific metadata associated with the operation.
It typically contains the ID of the target resource that the operation is performed on.
Any method that returns a long-running operation should document the metadata type, if any.

error

Status

The error result of the operation in case of failure or cancellation.

Includes only one of the fields error.

The operation result.
If done == false and there was no failure detected, neither error nor response is set.
If done == false and there was a failure detected, error is set.
If done == true, exactly one of error or response is set.

StatusStatus

The error result of the operation in case of failure or cancellation.

Field

Description

code

integer (int32)

Error code. An enum value of google.rpc.Code.

message

string

An error message.

details[]

object

A list of messages that carry the error details.

Was the article helpful?

Previous
Overview
Next
GetRecognition
© 2025 Direct Cursus Technology L.L.C.