Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Yandex Foundation Models
    • All tutorials
    • Disabling request logging
    • Getting an API key
      • Estimating request size in tokens
      • Sending a request in prompt mode
      • Sending a series of requests in chat mode
      • Sending an asynchronous request
      • Invoking a function from a model
    • Batch processing
  • Yandex Cloud ML SDK
  • Compatibility with OpenAI
  • Access management
  • Pricing policy
  • Public materials
  • Release notes

In this article:

  • Getting started
  • Request to a model via the REST API
  • Request to a model via the gRPC API
  • Streaming request via the gRPC API
  1. Step-by-step guides
  2. Text generation
  3. Sending a request in prompt mode

Sending a request in prompt mode

Written by
Yandex Cloud
Updated at April 11, 2025
  • Getting started
  • Request to a model via the REST API
  • Request to a model via the gRPC API
    • Streaming request via the gRPC API

To generate text in prompt mode, send a request to the model using the completion method or Yandex Cloud ML SDK.

Getting started

SDK
cURL
Python

To use the examples of requests using SDK:

  1. Create a service account and assign the ai.languageModels.user role to it.

  2. Get the service account API key and save it.

    The following examples use API key authentication. Yandex Cloud ML SDK also supports IAM token and OAuth token authentication. For more information, see Authentication in Yandex Cloud ML SDK.

  3. Use the pip package manager to install the ML SDK library:

    pip install yandex-cloud-ml-sdk
    

Get API authentication credentials as described in Authentication with the Yandex Foundation Models API.

To use the examples, install cURL.

Get API authentication credentials as described in Authentication with the Yandex Foundation Models API.

Request to a model via the REST API

cURL
Python

The example below is intended to be run in MacOS and Linux. To run it in Windows, see how to work with Bash in Microsoft Windows.

  1. Create a file with the request body, e.g., body.json:

    {
      "modelUri": "gpt://<folder_ID>/yandexgpt",
      "completionOptions": {
        "stream": false,
        "temperature": 0.1,
        "maxTokens": "1000",
        "reasoningOptions": {
          "mode": "DISABLED"
        }
      },
      "messages": [
        {
          "role": "system",
          "text": "Translate the text"
        },
        {
          "role": "user",
          "text": "To be, or not to be: that is the question."
        }
      ]
    }
    

    Where:

    • modelUri: ID of the model that will be used to generate the response. The parameter contains the Yandex Cloud folder ID or the tuned model's ID.

    • completionOptions: Request configuration options:

      • stream: Enables streaming of partially generated text. It can either be true or false.

      • temperature: With a higher temperature, you get more creative and randomized responses from the model. Its values range from 0 to 1, inclusive. The default value is 0.3.

      • maxTokens: Sets a limit on the model's output in tokens. The maximum number of tokens per generation depends on the model. For more information, see Quotas and limits in Yandex Foundation Models.

      • reasoningOptions.mode: Reasoning mode parameters. This is an optional parameter. The default value is DISABLED. The possible values are:

        • DISABLED: Reasoning mode is disabled.
        • ENABLED_HIDDEN: Reasoning mode is enabled. The model will decide by itself whether or not to use this mode for each particular request.
    • messages: List of messages that set the context for the model:

      • role: Message sender's role:

        • user: Used for sending user messages to the model.
        • system: Used to set the query context and define the model's behavior.
        • assistant: Used for responses generated by the model. In chat mode, the model's responses tagged with the assistant role are included in the message to save the conversation context. Do not send user messages with this role.
      • text: Message text.

  2. Send a request to the model by running this command:

    export FOLDER_ID=<folder_ID>
    export IAM_TOKEN=<IAM_token>
    curl \
      --request POST \
      --header "Content-Type: application/json" \
      --header "Authorization: Bearer ${IAM_TOKEN}" \
      --header "x-folder-id: ${FOLDER_ID}" \
      --data "@<path_to_JSON_file>" \
      "https://llm.api.cloud.yandex.net/foundationModels/v1/completion"
    

    Where:

    • FOLDER_ID: ID of the folder for which your account has the ai.languageModels.user role or higher.
    • IAM_TOKEN: IAM token you got before you started.
    Result:
    {
      "result": {
        "alternatives": [
          {
            "message": {
              "role": "assistant",
              "text": "To be, or not to be, that is the question."
            },
            "status": "ALTERNATIVE_STATUS_FINAL"
          }
        ],
        "usage": {
          "inputTextTokens": "28",
          "completionTokens": "10",
          "totalTokens": "38"
        },
        "modelVersion": "06.12.2023"
      }
    }
    
  1. Create a file named test.py with the model request code:

    import requests
    import argparse
    
    URL = "https://llm.api.cloud.yandex.net/foundationModels/v1/completion"
    
    def run(iam_token, folder_id, user_text):    
        # Building a request
        data = {}
        # Specifying model type
        data["modelUri"] = f"gpt://{folder_id}/yandexgpt"
        # Configuring options
        data["completionOptions"] = {"temperature": 0.3, "maxTokens": 1000}
        # Specifying context for the model
        data["messages"] = [
            {"role": "system", "text": "Correct errors in the text."},
            {"role": "user", "text": f"{user_text}"},
        ]
        
        # Sending the request
        response = requests.post(
            URL,
            headers={
                "Accept": "application/json",
                "Authorization": f"Bearer {iam_token}"
            },
            json=data,
        ).json()
    
        #Printing the result
        print(response)
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument("--iam_token", required=True, help="IAM token")
        parser.add_argument("--folder_id", required=True, help="Folder id")
        parser.add_argument("--user_text", required=True, help="User text")
        args = parser.parse_args()
        run(args.iam_token, args.folder_id, args.user_text)
    
  2. Run the test.py file, providing the IAM token and folder ID values:

    export IAM_TOKEN=<IAM_token>
    export FOLDER_ID=<folder_ID>
    export TEXT='Erors wont corrct themselfs'
    python test.py \
      --iam_token ${IAM_TOKEN} \
      --folder_id ${FOLDER_ID} \
      --user_text ${TEXT}
    
    Result:
    {'result': {'alternatives': [{'message': {'role': 'assistant', 'text': 'Errors will not correct themselves.'}, 'status': 'ALTERNATIVE_STATUS_FINAL'}], 'usage': {'inputTextTokens': '29', 'completionTokens': '9', 'totalTokens': '38'}, 'modelVersion': '07.03.2024'}}
    

Request to a model via the gRPC API

SDK
Python
  1. Create a file named generate-text.py and paste the following code into it:

    #!/usr/bin/env python3
    
    from __future__ import annotations
    from yandex_cloud_ml_sdk import YCloudML
    
    messages = [
        {
            "role": "system",
            "text": "Find errors in the text and correct them",
        },
        {
            "role": "user",
            "text": """Laminate flooring is sutiable for instalation in the kitchen or in a child's 
    room. It withsatnds moisturre and mechanical dammage thanks to 
    a 0.2 mm thick proctive layer of melamine films and 
    a wax-treated interlocking system.""",
        },
    ]
    
    
    def main():
        sdk = YCloudML(
            folder_id="<folder_ID>",
            auth="<API_key>",
        )
    
        result = (
            sdk.models.completions("yandexgpt").configure(temperature=0.5).run(messages)
        )
    
        for alternative in result:
            print(alternative)
    
    
    if __name__ == "__main__":
        main()
    

    Where:

    Note

    As input data for a request, Yandex Cloud ML SDK can accept a string, a dictionary, an object of the TextMessage class, or an array containing any combination of these data types. For more information, see Yandex Cloud ML SDK usage.

    • messages: List of messages that set the context for the model:

      • role: Message sender's role:

        • user: Used for sending user messages to the model.
        • system: Used to set the query context and define the model's behavior.
        • assistant: Used for responses generated by the model. In chat mode, the model's responses tagged with the assistant role are included in the message to save the conversation context. Do not send user messages with this role.
      • text: Message text.

    • <folder_ID>: ID of the folder in which the service account was created.

    • <API_key>: Service account API key you got earlier required for authentication in the API.

      The following examples use API key authentication. Yandex Cloud ML SDK also supports IAM token and OAuth token authentication. For more information, see Authentication in Yandex Cloud ML SDK.

    For more information about accessing a specific model version, see Accessing models.

  2. Run the created file:

    python3 generate-text.py
    

    Result:

    Alternative(role='assistant', text='Laminate flooring is suitable for installation in the kitchen or in a child's room. It withstands moisture and mechanical damage thanks to a 0.2 mm thick protective layer of melamine films and a wax-treated interlocking system.', status=<AlternativeStatus.FINAL: 3>)
    

The example below is intended to be run in MacOS and Linux. To run it in Windows, see how to work with Bash in Microsoft Windows.

  1. Clone the Yandex Cloud API repository by entering the code into a notebook cell:

    git clone https://github.com/yandex-cloud/cloudapi
    
  2. Use the pip package manager to install the grpcio-tools package:

    pip install grpcio-tools
    
  3. Go to the folder hosting the cloned Yandex Cloud API repository:

    cd <path_to_cloudapi_folder>
    
  4. Create the output folder:

    mkdir output
    
  5. Generate the client interface code:

    python -m grpc_tools.protoc -I . -I third_party/googleapis \
      --python_out=output \
      --grpc_python_out=output \
        google/api/http.proto \
        google/api/annotations.proto \
        yandex/cloud/api/operation.proto \
        google/rpc/status.proto \
        yandex/cloud/operation/operation.proto \
        yandex/cloud/validation.proto \
        yandex/cloud/ai/foundation_models/v1/text_generation/text_generation_service.proto \
        yandex/cloud/ai/foundation_models/v1/text_common.proto
    
  6. In the output folder, create a file named test.py with the model request code:

    # coding=utf8
    import argparse
    import grpc
    
    import yandex.cloud.ai.foundation_models.v1.text_common_pb2 as pb
    import yandex.cloud.ai.foundation_models.v1.text_generation.text_generation_service_pb2_grpc as service_pb_grpc
    import yandex.cloud.ai.foundation_models.v1.text_generation.text_generation_service_pb2 as service_pb
    
    def run(iam_token, folder_id, user_text):
        cred = grpc.ssl_channel_credentials()
        channel = grpc.secure_channel('llm.api.cloud.yandex.net:443', cred)
        stub = service_pb_grpc.TextGenerationServiceStub(channel)
    
        request = service_pb.CompletionRequest(
            model_uri=f"gpt://{folder_id}/yandexgpt",
            completion_options=pb.CompletionOptions(
                max_tokens={"value": 2000}, 
                temperature={"value": 0.5}
            ),
        )
        message_system = request.messages.add()
        message_system.role = "system"
        message_system.text = "Correct errors in the text."
    
        message_user = request.messages.add()
        message_user.role = "user"
        message_user.text = user_text
    
        it = stub.Completion(request, metadata=(
            ('authorization', f'Bearer {iam_token}'),
        ))
        for response in it:
            for alternative in response.alternatives:
                print (alternative.message.text)
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument("--iam_token", required=True, help="IAM token")
        parser.add_argument("--folder_id", required=True, help="Folder id")
        parser.add_argument("--user_text", required=True, help="User text")
        args = parser.parse_args()
        run(args.iam_token, args.folder_id, args.user_text)
    
  7. Run the test.py file, providing the IAM token and folder ID values:

    export IAM_TOKEN=<IAM_token>
    export FOLDER_ID=<folder_ID>
    export TEXT='Erors wont corrct themselfs'
    python output/test.py \
      --iam_token ${IAM_TOKEN} \
      --folder_id ${FOLDER_ID} \
      --user_text ${TEXT}
    
    Result:
    Errors will not correct themselves.
    

Streaming request via the gRPC API

SDK
Python

If the run_stream method is used, the server will provide not just the final text generation result but intermediate results as well. Each intermediate response contains the whole generation result that is currently available. Until the final response is received, the generation results may change as new messages arrive.

The difference the run_stream method makes can be seen most directly when creating and processing large texts.

  1. Create a file named generate-text.py and paste the following code into it:

    #!/usr/bin/env python3
    
    from __future__ import annotations
    from yandex_cloud_ml_sdk import YCloudML
    
    messages = [
        {"role": "system", "text": "Find errors in the text and correct them"},
        {"role": "user", "text": "Erors wyll not corrct themselfs."},
    ]
    
    
    def main():
        sdk = YCloudML(
            folder_id="<folder_ID>",
            auth="<API_key>",
        )
    
        model = sdk.models.completions("yandexgpt")
    
        for result in model.configure(temperature=0.5).run_stream(messages):
            for alternative in result:
                print(alternative)
    
    
    if __name__ == "__main__":
        main()
    

    Where:

    • messages: List of messages that set the context for the model:

      • role: Message sender's role:

        • user: Used for sending user messages to the model.
        • system: Used to set the query context and define the model's behavior.
        • assistant: Used for responses generated by the model. In chat mode, the model's responses tagged with the assistant role are included in the message to save the conversation context. Do not send user messages with this role.
      • text: Message text.

    • <folder_ID>: ID of the folder in which the service account was created.

    • <API_key>: Service account API key you got earlier required for authentication in the API.

      The following examples use API key authentication. Yandex Cloud ML SDK also supports IAM token and OAuth token authentication. For more information, see Authentication in Yandex Cloud ML SDK.

    For more information about accessing a specific model version, see Accessing models.

  2. Run the created file:

    python3 generate-text.py
    

    Result:

    Alternative(role='assistant', text='O', status=<AlternativeStatus.PARTIAL: 1>)
    Alternative(role='assistant', text='Errors will not correct themselves.', status=<AlternativeStatus.FINAL: 3>)
    

With the stream parameter enabled, the server will provide not just the final text generation result but intermediate results as well. Each intermediate response contains the whole generation result that is currently available. Until the final response is received, the generation results may change as new messages arrive.

The difference the stream parameter makes can be seen most directly when creating and processing large texts.

Warning

The stream parameter is not available for the model's asynchronous mode.

Generate the gRPC client interface code as described in this guide. At Step 6, generate a file named test.py with the code to access the model.

# coding=utf8
import argparse
import grpc

import yandex.cloud.ai.foundation_models.v1.text_common_pb2 as pb
import yandex.cloud.ai.foundation_models.v1.text_generation.text_generation_service_pb2_grpc as service_pb_grpc
import yandex.cloud.ai.foundation_models.v1.text_generation.text_generation_service_pb2 as service_pb

def run(iam_token, folder_id, user_text):
    cred = grpc.ssl_channel_credentials()
    channel = grpc.secure_channel('llm.api.cloud.yandex.net:443', cred)
    stub = service_pb_grpc.TextGenerationServiceStub(channel)

    request = service_pb.CompletionRequest(
            model_uri=f"gpt://{folder_id}/yandexgpt",
            completion_options=pb.CompletionOptions(
                max_tokens={"value": 2000},
                temperature={"value": 0.5},
                stream=True
            ),
        )
        message_system = request.messages.add()
        message_system.role = "system"
        message_system.text = "Correct errors in the text."
    
        message_user = request.messages.add()
        message_user.role = "user"
        message_user.text = user_text
    
        it = stub.Completion(request, metadata=(
            ('authorization', f'Bearer {iam_token}'),
        ))             
        
        for response in it:
            print(response)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--iam_token", required=True, help="IAM token")
    parser.add_argument("--folder_id", required=True, help="Folder id")
    parser.add_argument("--user_text", required=True, help="User text")
    args = parser.parse_args()
    run(args.iam_token, args.folder_id, args.user_text)
Result:
alternatives {
  message {
    role: "assistant"
    text: "E"
  }
  status: ALTERNATIVE_STATUS_PARTIAL
}
usage {
  input_text_tokens: 29
  completion_tokens: 1
  total_tokens: 30
}
model_version: "07.03.2024"

alternatives {
  message {
    role: "assistant"
    text: "Errors will not correct themselves."
  }
  status: ALTERNATIVE_STATUS_FINAL
}
usage {
  input_text_tokens: 29
  completion_tokens: 9
  total_tokens: 38
}
model_version: "07.03.2024"

See also

  • Text generation overview
  • Examples of working with ML SDK on GitHub

Was the article helpful?

Previous
Estimating request size in tokens
Next
Sending a series of requests in chat mode
Yandex project
© 2025 Yandex.Cloud LLC