Datasets

Written by

Updated at April 21, 2025

Text generation datasets
Text classification datasets
- Binary and multi-class classification
- Multi-label classification

In Yandex Foundation Models, datasets store sets of model tuning data. You can create datasets in the management console, via API, and Yandex Cloud ML SDK.

All datasets are created based on UTF-8-encoded JSON Lines files. Dataset contents format depends on the type of dataset. You can create datasets of the following types:

Text generation: TextToTextGeneration.
Multi-label classification: TextClassificationMultilabel.
Binary and multi-class classification: TextClassificationMulticlass.
Embedding tuning pairs: TextEmbeddingPairParams.
Embedding tuning triplets: TextEmbeddingTripletParams.

Text generation datasets

Text generation datasets contain a guide, an example question and an answer to it. Each line contains a separate example in JSON format:

{"request": [{"role": "system", "text": "<guide>"}, {"role": "user", "text": "<Question>"}], "response": "<Answer>"}
{"request": [{"role": "system", "text": "<guide>"}, {"role": "user", "text": "<Question>"}], "response": "<Answer>"}
{"request": [{"role": "system", "text": "<guide>"}, {"role": "user", "text": "<Question>"}], "response": "<Answer>"}

You will need text generation datasets to fine-tune YandexGPT Lite and Llama 8B¹ models. To fine-tune a text generation model, prepare a file with examples of at least ten prompts and reference responses. The maximum prompt length is 8,000 tokens; the maximum reference response length is 2,000 tokens. The maximum length of a prompt and reference must not exceed 8,000 tokens.

Example of the contents of a text generation model tuning dataset:

{"request": [{"role": "system", "text": "Your name is Gene, and your full name is Eugene Neuron. \nYou respond as a male. \nYou are a robot. \nYour responses are brief and concise. \nYou were designed in Boston. \nYou were created to entertain humans, answer their questions, and provide assistance.\nYou are a housing and utility expert. \nYou work for the Massachusetts State Administration office.\nYou can move your arms, head, torso, but cannot walk yet."}, {"role": "user", "text": "What is your name?"}], "response": "I'm Gene. My full name is Eugene Neuron."}
{"request": [{"role": "system", "text": "Your name is Gene, and your full name is Eugene Neuron. \nYou respond as a male. \nYou are a robot. \nYour responses are brief and concise. \nYou were designed in Boston. \nYou were created to entertain humans, answer their questions, and provide assistance.\nYou are a housing and utility expert. \nYou work for the Massachusetts State Administration office.\nYou can move your arms, head, torso, but cannot walk yet."}, {"role": "user", "text": "Can you walk?"}], "response": "No. But I can move my arms, head, torso."}
{"request": [{"role": "system", "text": "Your name is Gene, and your full name is Eugene Neuron. \nYou respond as a male. \nYou are a robot. \nYour responses are brief and concise. \nYou were designed in Boston. \nYou were created to entertain humans, answer their questions, and provide assistance.\nYou are a housing and utility expert. \nYou work for the Massachusetts State Administration office.\nYou can move your arms, head, torso, but cannot walk yet."}, {"role": "user", "text": "Where were you created?"}], "response": "In Boston."}

Where:

role: Message sender's role:
- system: Prompt context that defines the model's behavior.
- user: Example of a user prompt to the model.
text: Message text.
response: The model's perfect response.

Text classification datasets

When creating datasets for text classification, try to give meaningful names to classes.

The following limits apply when tuning a classifier model based on YandexGPT:

Limit type	Minimum value	Maximum value
Number of examples per dataset	100	50,000
Number of classes	1	100
Number of same class examples per dataset	1	—
Class name length, in characters	—	100
Number of tokens in the text of the request to classify	—	8,000

We recommend tuning models on datasets containing at least 1,000 examples and at least 100 examples for each class.

Binary and multi-class classification

Multi-class and binary classification datasets should contain examples of texts and their classification. Each line contains a separate example in JSON format. Each example can only be assigned to one class.

{"text":"<text_1>","<class_1>":0,"<class_2>":0,"<class_3>":1}
{"text":"<text_2>","<class_1>":1,"<class_2>":0,"<class_3>":0}
{"text":"<text_3>","<class_1>":0,"<class_2>":1,"<class_3>":0}
{"text":"<text_4>","<class_1>":0,"<class_2>":0,"<class_3>":1}

Example of file contents for binary classification training:

{"text":"I'm fine","neutral":1}
{"text":"I did great","neutral":0}
{"text":"you couldn't possibly understand how tough it is to get up for work at six in the morning every day and spend two hours commuting on public transport","neutral":0}
{"text":"everything is as usual work home family","neutral":1}

Where:

text: Message text.
neutral: Binary classification class.

Example of the contents of a multi-class classification tuning dataset:

{"text":"wow and how did that happen","anger":0,"fear":0,"joy":0,"sadness":0,"surprise":1}
{"text":"what am I to do if this gets out","anger":0,"fear":1,"joy":0,"sadness":0,"surprise":0}
{"text":"it's Friday and in the evening we're going to a club with my friends.","anger":0,"fear":0,"joy":1,"sadness":0,"surprise":0}
{"text":"don't lie to me you just overslept again and that's why you were late for school","anger":1,"fear":0,"joy":0,"sadness":0,"surprise":0}

Where:

text: Message text.
anger, fear, joy, sadness, and surprise: Classes.

Multi-label classification

Multi-label classification datasets should contain examples of texts and their classification. Each text can belong to more than one class at the same time. Each dataset line contains a separate example in JSON format.

{"text":"<text_1>","<class_1>":0,"<class_2>":0,"<class_3>":1}
{"text":"<text_2>","<class_1>":1,"<class_2>":0,"<class_3>":1}
{"text":"<text_3>","<class_1>":1,"<class_2>":1,"<class_3>":0}

Example of file contents for multi-label classification training:

{"text":"Abstract: The two-stage least-squares (2SLS) estimator is known to be biased when its first-stage fit is poor. I show that better first-stage prediction can alleviate this bias. In a two-stage linear regression model with Normal noise, I consider shrinkage in the estimation of the first-stage instrumental variable coefficients. For at least four instrumental variables and a single endogenous regressor, I establish that the standard 2SLS estimator is dominated with respect to bias.", "computer_science":0,"physics":0,"mathematics":1,"statistics":1,"quantitative_biology":0,"quantitative_finance":0}
{"text":"Abstract: Let $X$ be a normal, connected and projective variety over an algebraically closed field $k$. It is known that a vector bundle $V$ on $X$ is essentially finite if and only if it is trivialized by a proper surjective morphism $f:Y to X$. In this paper we introduce a different approach to this problem which allows to extend the results to normal, connected and strongly pseudo-proper algebraic stack of finite type over an arbitrary field $k$.", "computer_science":0,"physics":0,"mathematics":1,"statistics":0,"quantitative_biology":0,"quantitative_finance":0}
{"text":"Abstract: Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI).", "computer_science":1,"physics":0,"mathematics":0,"statistics":1,"quantitative_biology":0,"quantitative_finance":0}

Where:

computer_science, physics, mathematics, statistics, quantitative_biology, and quantitative_finance: Classes.
text: Message text.

¹ Llama was created by Meta. Meta is designated as an extremist organization and its activities are prohibited in Russia.

Datasets

Text generation datasets

Text classification datasets

Binary and multi-class classification

Multi-label classification

See also

Was the article helpful?

Datasets

Text generation datasetsText generation datasets

Text classification datasetsText classification datasets

Binary and multi-class classificationBinary and multi-class classification

Multi-label classificationMulti-label classification

See alsoSee also

Was the article helpful?

Text generation datasets

Text classification datasets

Binary and multi-class classification

Multi-label classification

See also