Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex AI Studio
  • Getting started with Model Gallery
    • About Yandex AI Studio
      • Overview
      • Common instance models
      • Dedicated instance models
      • Batch processing
      • Function calling
      • Reasoning mode
      • Formatting model responses
      • Embeddings
      • Datasets
      • Fine-tuning
      • Tokens
    • Yandex Workflows
    • Quotas and limits
    • Terms and definitions
  • Switching from the AI Assistant API to Responses API
  • Compatibility with OpenAI
  • Access management
  • Pricing policy
  • Audit Trails events
  • Public materials
  • Release notes

In this article:

  • Text generation datasets
  • Text generation requests
  • Text generation requests and responses
  • Text classification datasets
  • Multi-class classification
  • Multi-label classification
  • Binary classification
  • Embedding datasets
  • Vision language datasets
  • Request datasets
  • Response datasets
  1. Concepts
  2. Model Gallery
  3. Datasets

Datasets

Written by
Yandex Cloud
Updated at November 6, 2025
  • Text generation datasets
    • Text generation requests
    • Text generation requests and responses
  • Text classification datasets
    • Multi-class classification
    • Multi-label classification
    • Binary classification
  • Embedding datasets
  • Vision language datasets
    • Request datasets
    • Response datasets

In Yandex AI Studio, datasets store sets of data you need to tune and run models in batch mode. You can create datasets in the management console, via API, and Yandex Cloud ML SDK.

All datasets are created based on UTF-8-encoded JSON Lines files. Dataset contents structure depends on the type of dataset. You can create datasets of the following types:

  • Text generation: TextToTextGenerationRequest and TextToTextGeneration.
  • Multi-label classification and binary classification: TextClassificationMultilabel.
  • Multi-class classification: TextClassificationMulticlass.
  • Embedding tuning pairs: TextEmbeddingPairParams.
  • Embedding tuning triplets: TextEmbeddingTripletParams.
  • Vision language datasets: ImageTextToTextGenerationRequest and ImageTextToTextGeneration.

To get the current list of available dataset types, run this query:

grpcurl \
  -H "Authorization: Bearer <IAM_token>" \
  llm.api.cloud.yandex.net:443 yandex.cloud.ai.dataset.v1.DatasetService.ListTypes 

Text generation datasetsText generation datasets

AI Studio allows you to create two types of text generation datasets.

Text generation requestsText generation requests

Datasets with only the texts of requests in them can be used to run text generation models in batch mode. Each line contains a separate JSON request to initiate text generation. A request can contain either a single message with the user role or a dialog with the user and assistant roles and, optionally, also an instruction for the model:

{"request": [{"role": "user", "text": "<Question>"}]}
{"request": [{"role": "system", "text": "<instruction>"}, {"role": "user", "text": "<Question>"}]}
{"request": [{"role": "system", "text": "<instruction>"}, {"role": "user", "text": "<Replica_1>"}, {"role": "assistant", "text": "<Replica_2>"}, {"role": "user", "text": "<Replica_3>"}, {"role": "assistant", "text": "<Replica_4>"}]}

Text generation requests and responsesText generation requests and responses

Datasets with questions and answers are used to fine-tune models. The response is also returned in this format when accessing models in batch mode. Each line contains a separate example in JSON format:

{"request": [{"role": "user", "text": "<Question>"}], "response": "<Response>"}
{"request": [{"role": "system", "text": "<instruction>"}, {"role": "user", "text": "<Question>"}], "response": "<Response>"}
{"request": [{"role": "system", "text": "<instruction>"}, {"role": "user", "text": "<Replica_1>"}, {"role": "assistant", "text": "<Replica_2>"}, {"role": "user", "text": "<Replica_3>"}, {"role": "assistant", "text": "<Replica_4>"}], "response": "<Response>"}

Tip

When fine-tuning models, use the same instruction for each tuning example and use it when accessing the fine-tuned model. This improves fine-tuning efficiency.

The tuning dataset must include at least ten prompts and reference responses. The maximum prompt length is 8,000 tokens; the maximum reference response length is 2,000 tokens. The maximum length of a prompt and response must not exceed 8,000 tokens.

Example of the contents of a text generation model tuning dataset:

{"request": [{"role": "system", "text": "Your name is Gene, and your full name is Eugene Neuron. \nYou respond as a male. \nYou are a robot. \nYour responses are brief and concise. \nYou were designed in Boston. \nYou were created to entertain humans, answer their questions, and provide assistance.\nYou are a housing and utility expert. \nYou work for the Massachusetts State Administration office.\nYou can move your arms, head, torso, but cannot walk yet."}, {"role": "user", "text": "What is your name?"}], "response": "I'm Gene. My full name is Eugene Neuron."}
{"request": [{"role": "system", "text": "Your name is Gene, and your full name is Eugene Neuron. \nYou respond as a male. \nYou are a robot. \nYour responses are brief and concise. \nYou were designed in Boston. \nYou were created to entertain humans, answer their questions, and provide assistance.\nYou are a housing and utility expert. \nYou work for the Massachusetts State Administration office.\nYou can move your arms, head, torso, but cannot walk yet."}, {"role": "user", "text": "Can you walk?"}], "response": "No. But I can move my arms, head, and torso."}
{"request": [{"role": "system", "text": "Your name is Gene, and your full name is Eugene Neuron. \nYou respond as a male. \nYou are a robot. \nYour responses are brief and concise. \nYou were designed in Boston. \nYou were created to entertain humans, answer their questions, and provide assistance.\nYou are a housing and utility expert. \nYou work for the Massachusetts State Administration office.\nYou can move your arms, head, torso, but cannot walk yet."}, {"role": "user", "text": "Where were you created?"}], "response": "In Boston."}

Where:

  • role: Message sender's role:

    • system: Prompt context that defines the model's behavior.
    • user: Example of a user prompt to the model.
  • text: Message text.

  • response: The model's perfect response.

Text classification datasetsText classification datasets

When creating datasets for text classification, try to give meaningful names to classes.

The following limits apply when tuning a classifier model based on YandexGPT:

Type of limit Minimum value Maximum value
Number of examples per dataset 100 50,000
Number of classes 1 100
Number of same class examples per dataset 1 —
Class name length, in characters — 100
Number of tokens in the text of the request to classify — 8,000

We recommend tuning models on datasets containing at least 1,000 examples and at least 100 examples for each class.

Multi-class classificationMulti-class classification

Multi-class classification datasets should contain examples of texts and their classification data. Each line contains a separate example in JSON format. Each example can only be assigned to one class.

{"text":"<text_1>","<class_1>":0,"<class_2>":0,"<class_3>":1}
{"text":"<text_2>","<class_1>":1,"<class_2>":0,"<class_3>":0}
{"text":"<text_3>","<class_1>":0,"<class_2>":1,"<class_3>":0}
{"text":"<text_4>","<class_1>":0,"<class_2>":0,"<class_3>":1}

Example of the contents of a multi-class classification tuning dataset:

{"text":"wow and how did that happen","anger":0,"fear":0,"joy":0,"sadness":0,"surprise":1}
{"text":"what am I to do if this gets out","anger":0,"fear":1,"joy":0,"sadness":0,"surprise":0}
{"text":"it's Friday and in the evening we're going to a club with my friends.","anger":0,"fear":0,"joy":1,"sadness":0,"surprise":0}
{"text":"don't lie to me you just overslept again and that's why you were late for school","anger":1,"fear":0,"joy":0,"sadness":0,"surprise":0}

Where:

  • text: Message text.
  • anger, fear, joy, sadness, and surprise: Classes.

Multi-label classificationMulti-label classification

Multi-label classification datasets should contain examples of texts and their classification data. Each text can belong to more than one class at the same time. Each dataset line contains a separate example in JSON format.

{"text":"<text_1>","<class_1>":0,"<class_2>":0,"<class_3>":1}
{"text":"<text_2>","<class_1>":1,"<class_2>":0,"<class_3>":1}
{"text":"<text_3>","<class_1>":1,"<class_2>":1,"<class_3>":0}

Example of file contents for multi-label classification training:

{"text":"Abstract: The two-stage least-squares (2SLS) estimator is known to be biased when its first-stage fit is poor. I show that better first-stage prediction can alleviate this bias. In a two-stage linear regression model with Normal noise, I consider shrinkage in the estimation of the first-stage instrumental variable coefficients. For at least four instrumental variables and a single endogenous regressor, I establish that the standard 2SLS estimator is dominated with respect to bias.", "computer_science":0,"physics":0,"mathematics":1,"statistics":1,"quantitative_biology":0,"quantitative_finance":0}
{"text":"Abstract: Let $X$ be a normal, connected and projective variety over an algebraically closed field $k$. It is known that a vector bundle $V$ on $X$ is essentially finite if and only if it is trivialized by a proper surjective morphism $f:Y to X$. In this paper we introduce a different approach to this problem which allows to extend the results to normal, connected and strongly pseudo-proper algebraic stack of finite type over an arbitrary field $k$.", "computer_science":0,"physics":0,"mathematics":1,"statistics":0,"quantitative_biology":0,"quantitative_finance":0}
{"text":"Abstract: Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI).", "computer_science":1,"physics":0,"mathematics":0,"statistics":1,"quantitative_biology":0,"quantitative_finance":0}

Where:

  • computer_science, physics, mathematics, statistics, quantitative_biology, and quantitative_finance: Classes.
  • text: Message text.

Binary classificationBinary classification

Binary classification datasets should contain examples of texts and their classification attributes. Each dataset line contains a separate example in JSON format.

{"text":"<text_1>","<class>":1}
{"text":"<text_2>","<class>":0}
{"text":"<text_3>","<class>":1}

Example of file contents for binary classification training:

{"text":"I'm fine","neutral":1}
{"text":"I did great","neutral":0}
{"text":"you couldn't possibly understand how tough it is to get up for work at six in the morning every day and spend two hours commuting on public transport","neutral":0}
{"text":"everything is as usual work home family","neutral":1}

Where:

  • text: Message text.
  • neutral: Binary classification class.

Even though binary classification is a variety of multi-class classification, in practical terms, when working with binary classification, it makes more sense to operate with classification probability. To create a binary classification dataset, specify the TextClassificationMultilabel type.

Embedding datasetsEmbedding datasets

Embedding tuning datasets may contain pairs of semantically related texts or triplets with text, semantically related text and unrelated text. Each line contains a separate example in JSON format:

  • Pair dataset structure:

    {"anchor":"<main_text_1>","positive":"<related_text_1>"}
    {"anchor":"<main_text_2>","positive":"<related_text_2>"}
    
  • Triplet dataset structure:

    {"anchor":"<main_text_1>","positive":"<related_text_1>","negative":"<negative_example_1>"}
    {"anchor":"<main_text_2>","positive":"<related_text_2>","negative":"<negative_example_2>"}
    

    Example of the contents of an embedding tuning triplet dataset:

    {"anchor": "Sigale is a commune.", "positive": "Sigale is a commune in the Alpes-Maritimes department in southeastern France.", "negative": "When it was properly done, as in Austen, Whately said, imaginative literature concerned itself with generalised human experience from which the reader could gain important insights into human nature; in other words, it was moral."}
    {"anchor": "Helena is the capital city of the U.S. state of Montana .", "positive": "Helena is the capital city of the U.S. state of Montana and the county seat of Lewis and Clark County.", "negative": "After reading Pride and Prejudice, playwright Richard Sheridan advised a friend to buy it immediately cause 'it was one of the cleverest things' he had ever read."}
    

    Where:

    • anchor: Main text.
    • positive: Semantically related text.
    • negative: Unrelated text.

When fine-tuning embedding models for search scenarios, we recommend explicitly specifying whether the text is document text or query text:

{"anchor": "[QUERY] city austin","positive": "[DOCUMENT] Austin is located in the central part of Texas, northeast of San Antonio. The city is traversed by one interstate highway, I-35. Austin ranges in elevation from 88 to 405 meters above sea level.","negative": "[DOCUMENT] Biographical information about Jane Austen's life is scant, with a few surviving letters and biographical notes written by her family members. During her lifetime, Austen may have written up to 3,000 letters, but only 161 remain. Many of these letters were addressed to her elder sister, Cassandra."}

Vision language datasetsVision language datasets

You need vision language datasets when working with multimodal models in batch mode. AI Studio supports two types of datasets.

Request datasetsRequest datasets

Request datasets for vision language models contain request texts and Base64-encoded images. Each line contains a separate example in JSON format.

{"request": [{"role": "user", "content": [{"type": "text", "text": "Question"}, { "type": "image", "image": "Base64 image code"}]}]}

Response datasetsResponse datasets

Response datasets for vision language models contain request texts, Base64-encoded images, and a generated response for each request. Each line contains a separate example in JSON format.

{"request": [{"role": "user", "content": [{"type": "text", "text": "Question"}, {"type": "image", "image": "Base64 image code"}]}], "response": "this is a picture of a robot"}

Use casesUse cases

  • Fine-tuning a text generation model
  • Fine-tuning text classification models
  • Model fine-tuning in DataSphere Notebooks
  • Creating a dataset for tuning a text generation model
  • Creating a dataset for tuning a text classification model

Was the article helpful?

Previous
Embeddings
Next
Fine-tuning
© 2025 Direct Cursus Technology L.L.C.