Dedicated instances
This feature is in the Preview stage.
With AI Studio, you can deploy some models on a dedicated instance. Unlike a manual deployment on Yandex Compute Cloud VMs, you do not have to configure the environment or select optimal VM parameters. AI Studio provides stable, reliable, and efficient model inference and monitors its operation automatically.
Dedicated instances have a number of advantages:
- Guaranteed performance parameters that are not affected by other users' traffic.
- No additional quotas for requests and parallel generations. The restrictions you get only depend on the instance configuration you select.
- Optimized model inference for efficient hardware utilization.
Dedicated instances will benefit you if you need to process massive volumes of requests without delays. A dedicated instance is not priced based on the amount of incoming and outgoing tokens: you only pay for its running time.
Dedicated instance models
All deployed models are accessible via an API compatible with OpenAI, ML SDK, and in AI Playground. To deploy a dedicated instance, you need the ai.models.editor role or higher for the folder. To access the model, it is enough to have the ai.languageModels.user role.
|
Model |
Context |
License |
|
Qwen 2.5 VL 32B Instruct |
4,096 |
Apache 2.0 |
|
Qwen 2.5 72B Instruct |
16,384 |
|
|
Gemma 3 4B it |
4,096 |
|
|
Gemma 3 12B it |
4,096 |
|
|
gpt-oss-20b |
128,000 |
Apache 2.0 |
|
gpt-oss-120b |
128,000 |
Apache 2.0 |
|
T-pro-it-2.0-FP8 |
40,000 |
Apache 2.0 |
Dedicated instance configurations
Each model may be available for deployment in several configurations: S, M, or L. Each configuration guarantees specific values of TTFT (time to first token), Latency (time it takes to generate a response), and TPS (tokens per second) for requests with different context lengths.
The figure below shows the dependence of latency and the number of tokens processed by the model on the number of parallel generations (Concurrency in the figure): up to a certain point, the more generations the model processes in parallel, the longer the generation will last, and the more tokens will be generated per second.