Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Yandex Foundation Models
    • All tutorials
    • Disabling request logging
    • Getting an API key
      • Estimating request size in tokens
      • Sending a request in prompt mode
      • Sending a series of requests in chat mode
      • Sending an asynchronous request
      • Invoking a function from a model
    • Batch processing
  • Yandex Cloud ML SDK
  • Compatibility with OpenAI
  • Access management
  • Pricing policy
  • Public materials
  • Release notes

In this article:

  • Getting started
  • Send a request to the model
  1. Step-by-step guides
  2. Text generation
  3. Sending an asynchronous request

Sending an asynchronous request

Written by
Yandex Cloud
Updated at April 11, 2025
  • Getting started
  • Send a request to the model

You can send requests to text generation models in asynchronous mode. In response to an asynchronous request, the model will return an operation object containing the operation ID you can use to follow up the operation's progress and get the result once the generation is complete. Use this mode if you do not need an urgent response, since asynchronous requests take longer to complete than synchronous ones.

Getting started

SDK
cURL

To use the examples of requests using SDK:

  1. Create a service account and assign the ai.languageModels.user role to it.

  2. Get the service account API key and save it.

    The following examples use API key authentication. Yandex Cloud ML SDK also supports IAM token and OAuth token authentication. For more information, see Authentication in Yandex Cloud ML SDK.

  3. Use the pip package manager to install the ML SDK library:

    pip install yandex-cloud-ml-sdk
    

Get API authentication credentials as described in Authentication with the Yandex Foundation Models API.

To use the examples, install cURL.

Send a request to the model

SDK
cURL

When using Yandex Cloud ML SDK, you can configure your code to wait for the operation to complete and return the response. To do this, use either the sleep function of the time module or the wait method. The example utilizes both of these methods one by one.

  1. Create a file named generate-deferred.py and paste the following code into it:

    #!/usr/bin/env python3
    
    from __future__ import annotations
    import time
    from yandex_cloud_ml_sdk import YCloudML
    
    messages_1 = [
        {
            "role": "system",
            "text": "Find errors in the text and correct them",
        },
        {
            "role": "user",
            "text": """Laminate flooring is sutiable for instalation in the kitchen or in a child's 
    room. It withsatnds moisturre and mechanical dammage thanks to 
    a 0.2 mm thick proctive layer of melamine films and 
    a wax-treated interlocking system.""",
        },
    ]
    
    messages_2 = [
        {"role": "system", "text": "Find errors in the text and correct them"},
        {"role": "user", "text": "Erors wyll not corrct themselfs."},
    ]
    
    
    def main():
    
        sdk = YCloudML(
            folder_id="<folder_ID>",
            auth="<API_key>",
        )
    
        model = sdk.models.completions("yandexgpt")
    
        # Variant 1: wait for the operation to complete using 5-second sleep periods
    
        print("Variant 1:")
    
        operation = model.configure(temperature=0.5).run_deferred(messages_1)
    
        status = operation.get_status()
        while status.is_running:
            time.sleep(5)
            status = operation.get_status()
    
        result = operation.get_result()
        print(result)
    
        # Variant 2: wait for the operation to complete using the wait method
    
        print("Variant 2:")
    
        operation = model.run_deferred(messages_2)
    
        result = operation.wait()
        print(result)
    
    
    if __name__ == "__main__":
        main()
    

    Where:

    Note

    As input data for a request, Yandex Cloud ML SDK can accept a string, a dictionary, an object of the TextMessage class, or an array containing any combination of these data types. For more information, see Yandex Cloud ML SDK usage.

    • messages_1 and messages_2: Arrays of messages providing the context for the model, each used for a different method of getting an asynchronous request result:

      • role: Message sender's role:

        • user: Used for sending user messages to the model.
        • system: Used to set the query context and define the model's behavior.
        • assistant: Used for responses generated by the model. In chat mode, the model's responses tagged with the assistant role are included in the message to save the conversation context. Do not send user messages with this role.
    • text: Message text.

    • <folder_ID>: ID of the folder in which the service account was created.

    • <API_key>: Service account API key you got earlier required for authentication in the API.

      The following examples use API key authentication. Yandex Cloud ML SDK also supports IAM token and OAuth token authentication. For more information, see Authentication in Yandex Cloud ML SDK.

    For more information about accessing a specific model version, see Accessing models.

  2. Run the created file:

    python3 generate-deferred.py
    

    Result:

    Variant 1:
    GPTModelResult(alternatives=(Alternative(role='assistant', text='Ламинат подойдёт для укладки на кухне или в детской комнате – он не боится влаги и механических повреждений благодаря защитному слою из облицованных меламиновых плёнок толщиной 0,2 мм и обработанным воском замкам.', status=<AlternativeStatus.FINAL: 3>),), usage=Usage(input_text_tokens=74, completion_tokens=46, total_tokens=120), model_version='23.10.2024')
    Variant 2:
    GPTModelResult(alternatives=(Alternative(role='assistant', text='Errors will not correct themselves.\n\nErors → errors.', status=<AlternativeStatus.FINAL: 3>),), usage=Usage(input_text_tokens=32, completion_tokens=16, total_tokens=48), model_version='23.10.2024')
    

    The code waits for the result of the first method and then of the second one.

To use the examples, install cURL.

The example below is intended to be run in MacOS and Linux. To run it in Windows, see how to work with Bash in Microsoft Windows.

  1. Create a file with the request body, e.g., body.json:

    {
      "modelUri": "gpt://<folder_ID>/yandexgpt",
      "completionOptions": {
        "stream": false,
        "temperature": 0.1,
        "maxTokens": "2000",
        "reasoningOptions": {
          "mode": "DISABLED"
        }
      },
      "messages": [
        {
          "role": "system",
          "text": "Translate the text"
        },
        {
          "role": "user",
          "text": "To be, or not to be: that is the question."
        }
      ]
    }
    
    • modelUri: ID of the model that will be used to generate the response. The parameter contains the Yandex Cloud folder ID or the tuned model's ID.

    • completionOptions: Request configuration options:

      • stream: Enables streaming of partially generated text. It can either be true or false.

      • temperature: With a higher temperature, you get more creative and randomized responses from the model. Its values range from 0 to 1, inclusive. The default value is 0.3.

      • maxTokens: Sets a limit on the model's output in tokens. The maximum number of tokens per generation depends on the model. For more information, see Quotas and limits in Yandex Foundation Models.

      • reasoningOptions.mode: Reasoning mode parameters. This is an optional parameter. The default value is DISABLED. The possible values are:

        • DISABLED: Reasoning mode is disabled.
        • ENABLED_HIDDEN: Reasoning mode is enabled. The model will decide by itself whether or not to use this mode for each particular request.
    • messages: List of messages that set the context for the model:

      • role: Message sender's role:

        • user: Used for sending user messages to the model.
        • system: Used to set the query context and define the model's behavior.
        • assistant: Used for responses generated by the model. In chat mode, the model's responses tagged with the assistant role are included in the message to save the conversation context. Do not send user messages with this role.
      • text: Message text.

  2. Send a request to the model by running this command:

    export FOLDER_ID=<folder_ID>
    export IAM_TOKEN=<IAM_token>
    curl \
      --request POST \
      --header "Content-Type: application/json" \
      --header "Authorization: Bearer ${IAM_TOKEN}" \
      --header "x-folder-id: ${FOLDER_ID}" \
      --data "@<path_to_JSON_file>" \
      "https://llm.api.cloud.yandex.net/foundationModels/v1/completionAsync"
    

    Where:

    • FOLDER_ID: ID of the folder for which your account has the ai.languageModels.user role or higher.
    • IAM_TOKEN: IAM token you got before you started.

    In the response, the service will return the operation object:

    {
      "id": "d7qi6shlbvo5********",
      "description": "Async GPT Completion",
      "createdAt": "2023-11-30T18:31:32Z",
      "createdBy": "aje2stn6id9k********",
      "modifiedAt": "2023-11-30T18:31:33Z",
      "done": false,
      "metadata": null
    }
    

    Save the operation id you get in the response.

  3. Send a request to get the operation result:

    curl \
      --request GET \
      --header "Authorization: Bearer ${IAM_TOKEN}" \
      https://operation.api.cloud.yandex.net/operations/<operation_ID>
    

    Result example:

    {
      "done": true,
      "response": {
        "@type": "type.googleapis.com/yandex.cloud.ai.foundation_models.v1.CompletionResponse",
        "alternatives": [
          {
            "message": {
              "role": "assistant",
              "text": "To be, or not to be, that is the question."
            },
            "status": "ALTERNATIVE_STATUS_FINAL"
          }
        ],
        "usage": {
          "inputTextTokens": "31",
          "completionTokens": "10",
          "totalTokens": "41"
        },
        "modelVersion": "18.01.2024"
      },
      "id": "d7qo21o5fj1u********",
      "description": "Async GPT Completion",
      "createdAt": "2024-05-12T18:46:54Z",
      "createdBy": "ajes08feato8********",
      "modifiedAt": "2024-05-12T18:46:55Z"
    }
    

See also

  • Text generation overview
  • Examples of working with ML SDK on GitHub

Was the article helpful?

Previous
Sending a series of requests in chat mode
Next
Invoking a function from a model
Yandex project
© 2025 Yandex.Cloud LLC