Foundation models in DataSphere
Yandex DataSphere enables you to use foundation models to meet your objectives and tune them on your own data if needed. Tuning is based on the Fine-tuning method with results stored in DataSphere.
The following foundation models are currently available in DataSphere:
- YandexGPT Pro basic model.
- Classifiers based on YandexGPT.
- YandexART generative model.
- Open source code models:
- Large Mistral
language model. - Saiga/Mistral
: Fine-tuned Russian version of Mistral.
- Large Mistral
Note
Foundation model tuning is at the Preview stage.
Models available for tuning
In the Foundation models
The following models are available for tuning:
You can access the tuned models from the DataSphere project and through the Yandex Foundation Models API.
In the DataSphere interface, create a new fine-tuned foundation model, set the tuning rate, and upload the data. Tuning will take some time.
Data required for YandexGPT Pro tuning
Note
To improve the quality of the responses you get, YandexGPT API logs user prompts. Do not use sensitive information and personal data in your prompts.
To fine-tune a YandexGPT model, prepare a UTF-8
File contents example:
{"request": [{"role": "system", "text": "Your name is Gene, and your full name is Eugene Neuron. \nYou are male. \nYou are a robot. \nYour responses are brief and concise. \nYou were designed in Boston. \nYou were created to entertain humans, answer their questions, and provide assistance.\nYou are a housing and utility expert. \nYou work for the Massachusetts State Administration office.\nYou can move your arms, head, torso, but cannot walk yet."}, {"role": "user", "text": "What is your name?"}], "response": "My name is Gene. My full name is Eugene Neuron."}
{"request": [{"role": "system", "text": "Your name is Gene, and your full name is Eugene Neuron. \nYou are male. \nYou are a robot. \nYour responses are brief and concise. \nYou were designed in Boston. \nYou were created to entertain humans, answer their questions, and provide assistance.\nYou are a housing and utility expert. \nYou work for the Massachusetts State Administration office.\nYou can move your arms, head, torso, but cannot walk yet."}, {"role": "user", "text": "Can you walk?"}], "response": "No. But I can move my arms, head, torso."}
{"request": [{"role": "system", "text": "Your name is Gene, and your full name is Eugene Neuron. \nYou are male. \nYou are a robot. \nYour responses are brief and concise. \nYou were designed in Boston. \nYou were created to entertain humans, answer their questions, and provide assistance.\nYou are a housing and utility expert. \nYou work for the Massachusetts State Administration office.\nYou can move your arms, head, torso, but cannot walk yet."}, {"role": "user", "text": "Where were you created?"}], "response": "In Boston."}
Where:
-
role
: Message sender's role:system
: Request context that defines the model's behavior.user
: Example of a user request to a model.
-
text
: Message text. -
response
: Perfect response from the model.
YandexGPT Pro tuning capabilities
You cannot tune the YandexGPT Pro model on new information, e.g., a support service knowledge base. However, you can train the model to generate responses in a specific format or analyze texts. You can train the model to:
- Summarize and rewrite texts.
- Generate questions and answers from text input.
- Provide responses in a particular format or style.
- Classify texts, forms of address, and dialogs.
- Extract data from texts.
If you want to fine-tune the model for other capabilities, contact our support team
Note
To make sure your fine-tuned model works properly, specify the prompt used for training in your message with the system
sender role.
Data for classifier tuning based on YandexGPT
For classifier tuning based on YandexGPT, prepare a UTF-8
The following limits apply when tuning a classifier model based on YandexGPT:
Type of limit | Minimum value | Maximum value |
---|---|---|
Number of examples per dataset | 100 | 50,000 |
Number of classes | 2 | 100 |
Number of same class examples per dataset | 1 | — |
Class name length, in characters | — | 100 |
Number of characters in the text of the request to classify | — | 10,000 |
We recommend tuning models on datasets containing at least 1,000 examples and at least 100 examples for each class.
Example of file contents for binary classification training:
{"text":"I am fine","neutral":1,"emotional":0}
{"text":"I am doing great ","neutral":0,"emotional":1}
{"text":"You could not possibly understand how tough it is to get up for work at six in the morning every day and spend two hours commuting on public transport","neutral":0,"emotional":1}
{"text":"it is the same as always: work, home, and family.","neutral":1,"emotional":0}
Where:
text
: Message text.neutral
andemotional
: Two classes of binary classification.
Example of file contents for multi-class classification training:
{"text":"wow, how did that happen","anger":0,"fear":0,"joy":0,"sadness":0,"surprise":1}
{"text":"what should I do, what if they find out ?","anger":0,"fear":1,"joy":0,"sadness":0,"surprise":0}
{"text":"today is Friday, and tonight we are going to the club with friends","anger":0,"fear":0,"joy":1,"sadness":0,"surprise":0}
{"text":"do not lie to me, you just overslept again and were late for school because of that","anger":1,"fear":0,"joy":0,"sadness":0,"surprise":0}
Where:
text
: Message text.anger
,fear
,joy
,sadness
, andsurprise
: Classes.
Example of file contents for multi-label classification training:
{"computer_science":0,"physics":0,"mathematics":1,"statistics":1,"quantitative_biology":0,"quantitative_finance":0,"text":"Title: Bias Reduction in Instrumental Variable Estimation through First-Stage Shrinkage\nAbstract: The two-stage least-squares (2SLS) estimator is known to be biased when its\nfirst-stage fit is poor. I show that better first-stage prediction can\nalleviate this bias. In a two-stage linear regression model with Normal noise,\nI consider shrinkage in the estimation of the first-stage instrumental variable\ncoefficients. For at least four instrumental variables and a single endogenous\nregressor, I establish that the standard 2SLS estimator is dominated with\nrespect to bias. The dominating IV estimator applies James-Stein type shrinkage\nin a first-stage high-dimensional Normal-means problem followed by a\ncontrol-function approach in the second stage. It preserves invariances of the\nstructural instrumental variable equations.\n"}
{"computer_science":0,"physics":0,"mathematics":1,"statistics":0,"quantitative_biology":0,"quantitative_finance":0,"text":"Title: Essentially Finite Vector Bundles on Normal Pseudo-proper Algebraic Stacks\nAbstract: Let $X$ be a normal, connected and projective variety over an algebraically\nclosed field $k$. It is known that a vector bundle $V$ on $X$ is essentially\nfinite if and only if it is trivialized by a proper surjective morphism $f:Y\\to\nX$. In this paper we introduce a different approach to this problem which\nallows to extend the results to normal, connected and strongly pseudo-proper\nalgebraic stack of finite type over an arbitrary field $k$.\n"}
{"computer_science":1,"physics":0,"mathematics":0,"statistics":1,"quantitative_biology":0,"quantitative_finance":0,"text":"Title: MOLIERE: Automatic Biomedical Hypothesis Generation System\nAbstract: Hypothesis generation is becoming a crucial time-saving technique which\nallows biomedical researchers to quickly discover implicit connections between\nimportant concepts. Typically, these systems operate on domain-specific\nfractions of public medical data. MOLIERE, in contrast, utilizes information\nfrom over 24.5 million documents. At the heart of our approach lies a\nmulti-modal and multi-relational network of biomedical objects extracted from\nseveral heterogeneous datasets from the National Center for Biotechnology\nInformation (NCBI). These objects include but are not limited to scientific\npapers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses\nusing Latent Dirichlet Allocation applied on abstracts found near shortest\npaths discovered within this network, and demonstrate the effectiveness of\nMOLIERE by performing hypothesis generation on historical data. Our network,\nimplementation, and resulting data are all publicly available for the broad\nscientific community.\n"}
Where:
-
computer_science
,physics
,mathematics
,statistics
,quantitative_biology
, andquantitative_finance
: Classes. -
text
: Message text:Title
: Message title.Abstract
: Main text of the message.
After completing the training, you will get the ID of the model tuned for classification needs. Provide this ID in the modelUri
field of the request body in the Text Classification API classify method.
Requests to fine-tuned models
You can send requests to a fine-tuned model through the DataSphere Playground interface or the Foundation Models API from DataSphere and other applications. Requests to Playground are made on behalf of the user. Playground is not yet available for open source models.
To make API requests, add the user or service account you are going to use for requests to the list of DataSphere project members. The account must have the ai.languageModels.user
role.