Yandex Cloud
Поиск
Связаться с намиПодключиться
  • Истории успеха
  • Документация
  • Блог
  • Все сервисы
  • Статус работы сервисов
    • Популярные
    • Инфраструктура и сеть
    • Платформа данных
    • Контейнеры
    • Инструменты разработчика
    • Бессерверные вычисления
    • Безопасность
    • Мониторинг и управление ресурсами
    • ИИ для бизнеса
    • Бизнес-инструменты
  • Все решения
    • По отраслям
    • По типу задач
    • Экономика платформы
    • Безопасность
    • Техническая поддержка
    • Каталог партнёров
    • Обучение и сертификация
    • Облако для стартапов
    • Облако для крупного бизнеса
    • Центр технологий для общества
    • Облако для интеграторов
    • Поддержка IT-бизнеса
    • Облако для фрилансеров
    • Обучение и сертификация
    • Блог
    • Документация
    • Контент-программа
    • Мероприятия и вебинары
    • Контакты, чаты и сообщества
    • Идеи
    • Калькулятор цен
    • Тарифы
    • Промоакции и free tier
    • Правила тарификации
  • Истории успеха
  • Документация
  • Блог
Проект Яндекса
© 2025 ООО «Яндекс.Облако»
Yandex AI Studio
  • Начало работы с Model Gallery
    • О сервисе Yandex AI Studio
    • Yandex Workflows
    • Квоты и лимиты
    • Термины и определения
  • Переход с AI Assistant API на Responses API
  • Совместимость с OpenAI
    • О Yandex Cloud ML SDK
      • Overview
        • Overview
        • Models domain
        • Assistants domain
        • Tools domain
        • Files domain
        • Threads domain
        • Runs domain
        • Search indexes domain
        • Search API domain
        • Datasets domain
        • Tuning domain
        • Batch domain
        • Chat domain
      • Authentication
      • Request retries
  • Управление доступом
  • Правила тарификации
  • Аудитные логи Audit Trails
  • Публичные материалы
  • История изменений

В этой статье:

  • class yandex_cloud_ml_sdk._datasets.domain.AsyncDatasets
  • class yandex_cloud_ml_sdk._datasets.dataset.AsyncDataset
  • class yandex_cloud_ml_sdk._datasets.draft.AsyncDatasetDraft
  1. Yandex Cloud ML SDK
  2. Справочник SDK (англ.)
  3. Async SDK
  4. Datasets domain

Datasets domain

Статья создана
Yandex Cloud
Обновлена 7 ноября 2025 г.
  • class yandex_cloud_ml_sdk._datasets.domain.AsyncDatasets
  • class yandex_cloud_ml_sdk._datasets.dataset.AsyncDataset
  • class yandex_cloud_ml_sdk._datasets.draft.AsyncDatasetDraft

class yandexcloudmlsdk.datasets.domain.AsyncDatasetsclass yandex_cloud_ml_sdk._datasets.domain.AsyncDatasets

This class provides methods to create and manage datasets of a specific type.

async get(dataset_id, *, timeout=60)

Fetch a dataset from the server using its ID.

Parameters

  • dataset_id (str) – the unique identifier of the dataset to fetch.
  • timeout (float) – the time to wait for the request. Defaults to 60 seconds.

Return type

AsyncDataset

async list(*, status=Undefined, name_pattern=Undefined, task_type=Undefined, timeout=60)

Fetch a list of datasets based on specified filters.

Parameters

  • status (str | Undefined | DatasetStatus | Iterable[str | DatasetStatus]) – the status filter for datasets; can be a single status or an iterable of statuses.
  • name_pattern (str | Undefined) – a pattern to filter dataset names.
  • task_type (str | Undefined | Iterable[str]) – the type of task associated with the datasets; can be a single task type or an iterable of task types.
  • timeout (float) – the time to wait for the request. Defaults to 60 seconds.

Return type

AsyncIterator[AsyncDataset]

async list_upload_formats(task_type, *, timeout=60)

Fetch available upload formats for a specified task type.

Parameters

  • task_type (str) – the type of task for which to fetch upload formats.
  • timeout (float) – the time to wait for the request in seconds. Defaults to 60 seconds.

Return type

tuple[str, …]

async list_upload_schemas(task_type, *, timeout=60)

Fetch available upload schemas for a specified task type.

Parameters

  • task_type (str) – the type of task for which to fetch upload schemas.
  • timeout (float) – the time to wait for the request in seconds. Defaults to 60 seconds.

Return type

tuple[DatasetUploadSchema, …]

completions

a helper for autocompletion text-to-text generation tasks

draft_from_path(path, *, task_type=Undefined, upload_format=Undefined, name=Undefined, description=Undefined, metadata=Undefined, labels=Undefined, allow_data_logging=Undefined)

Create a new dataset draft from a specified path.

Parameters

  • path (str | PathLike) – the path to the data file or directory.
  • task_type (str | Undefined) – the type of task for the dataset.
  • upload_format (str | Undefined) – the format in which the data should be uploaded.
  • name (str | Undefined) – the name of the dataset.
  • description (str | Undefined) – a description of the dataset.
  • metadata (str | Undefined) – metadata associated with the dataset.
  • labels (dict[str, str] | Undefined) – a set of labels for the dataset.
  • allow_data_logging (bool | Undefined) – a flag indicating if data logging is allowed.

Return type

DatasetDraftT

text_classifiers_binary

a helper for autocompletion binary text classification tasks

text_classifiers_multiclass

a helper for autocompletion multiclass text classification tasks

text_classifiers_multilabel

a helper for autocompletion multilabel text classification tasks

text_embeddings_pair

a helper for autocompletion pairwise text embeddings tasks

text_embeddings_triplet

a helper for autocompletion triplet text embeddings tasks

class yandexcloudmlsdk.datasets.dataset.AsyncDatasetclass yandex_cloud_ml_sdk._datasets.dataset.AsyncDataset

async update(*, name=Undefined, description=Undefined, labels=Undefined, timeout=60)

Updates the dataset with the provided parameters.

Parameters

  • name (str | Undefined) – the name for the dataset.
  • description (str | Undefined) – the description for the dataset.
  • labels (dict[str, str] | Undefined) – a set of labels for the dataset.
  • timeout (float) – the timeout, or the maximum time to wait for the update request. Defaults to 60 seconds.

Return type

Self

async delete(*, timeout=60)

Deletes the dataset.

Parameters

timeout (float) – the timeout, or maximum time to wait for the delete request. Defaults to 60 seconds.

Return type

None

async list_upload_formats(*, timeout=60)

Retrieve a list of upload formats for the dataset.

Parameters

timeout (float) – the maximum time to wait for the operation to complete. Defaults to 60 seconds.

Return type

tuple[str, …]

async download(*, download_path, timeout=60, exist_ok=False, max_parallel_downloads=16)

Download a dataset to the specified path.

Parameters

  • download_path (str | PathLike) – the path where the dataset will be downloaded.
  • timeout (float) – the timeout, or maximum time to wait for the download. Defaults to 60 seconds.
  • exist_ok (bool) – if True, do not raise an error if files already exist. Defaults to False.
  • max_parallel_downloads (int) – the maximum number of concurrent downloads.

Return type

tuple[Path, …]

read(*, timeout=60, batch_size=Undefined)

Reads the dataset from backend and yields it records one by one.

This method lazily loads records by chunks, minimizing memory usage for large datasets. The iterator yields dictionaries where keys are field names and values are parsed data.

Примечание

This method creates temporary files in the system’s default temporary directory during operation. To control the location of temporary files, refer to Python’s tempfile.gettempdir() documentation. Temporary files are automatically cleaned up after use.

Parameters

  • timeout (float) – Maximum time in seconds for both gRPC and HTTP operations. Includes connection establishment, data transfer, and processing time. Defaults to 60 seconds.
  • batch_size (int | Undefined) – Number of records to load to memory in one chunk. When UNDEFINED (default), uses backend’s optimal chunk size (typically corresponds to distinct Parquet files storage layout).

Yields

Dictionary representing single record with field-value pairs

Return type

AsyncIterator[dict[Any, Any]]

folder_id: str

the ID of the folder which contains the dataset

name: str | None

the name of the dataset

description: str | None

a description of the dataset

metadata: str | None

metadata associated with the dataset

created_by: str

the user who created the dataset

created_at: datetime

the timestamp when the dataset was created

updated_at: datetime

the timestamp when the dataset was last updated

labels: dict[str, str] | None

a dictionary of labels associated with the dataset

allow_data_logging: bool

indicates if data logging is allowed for this dataset

status: DatasetStatus

the current status of the dataset

task_type: str

the type of task associated with the dataset

rows: int

the number of rows in the dataset

size_bytes: int

the size of the dataset in bytes

validation_errors: tuple[ValidationErrorInfo]... ,

a tuple of validation errors associated with the dataset

id: str

class yandexcloudmlsdk.datasets.draft.AsyncDatasetDraftclass yandex_cloud_ml_sdk._datasets.draft.AsyncDatasetDraft

This class allows users to create a draft representation of a dataset without immediately interacting with the server. This draft serves as a structure for storing configuration settings, enabling users to edit the dataset’s properties before finalizing the upload.

async upload_deferred(*, timeout=60, upload_timeout=360, raise_on_validation_failure=True, chunk_size=104857600, parallelism=None)

Creates a dataset object on the server, uploads data to S3, triggers validation of the created dataset, and waits for its completion.

Parameters

  • timeout (float) – the time to wait for the dataset creation operation. Defaults to 60 seconds.
  • upload_timeout (float) – the time to wait for the upload operation. Defaults to 360 seconds.
  • raise_on_validation_failure (bool) – a flag indicating whether to raise an error if validation fails after the upload. Default is True.
  • chunk_size (int) – the size of chunks to use when uploading data. Default is defined by DEFAULT_CHUNK_SIZE.
  • parallelism (int | None) – the level of parallelism for the upload. Default is None, which means no limit.

Return type

AsyncOperation[AsyncDataset]

async upload(*, timeout=60, upload_timeout=360, raise_on_validation_failure=True, poll_timeout=21600, poll_interval=60, chunk_size=104857600, parallelism=None)

This method also performs the upload operation, but unlike _upload_deferred function, which returns an operation object, it directly returns the result of the completed operation.

Parameters

  • timeout (float) – the time to wait for the upload operation. Defaults to 60 seconds.
  • poll_timeout (int) – the time to wait for polling the operation status. Default is defined by DEFAULT_OPERATION_POLL_TIMEOUT.
  • poll_interval (float) – the interval at which to poll for operation status Defaults to 60 seconds).
  • kwargs – additional keyword arguments passed to _upload_deferred.
  • upload_timeout (float)
  • raise_on_validation_failure (bool)
  • chunk_size (int)
  • parallelism (int | None)

allow_data_logging: bool | None = None

a flag indicating if iyt is allowed to use the dataset to improve the models quality. Default false.

configure(**kwargs)

Parameters

kwargs (Any)

Return type

Self

description: str | None = None

a description of the dataset

labels: dict[str, str] | None = None

labels for categorizing the dataset

metadata: str | None = None

metadata associated with the dataset

name: str | None = None

the name of the dataset

path: PathLike | None = None

the file path to the dataset

task_type: str | None = None

the type of task associated with the dataset

upload_format: str | None = None

the format in which the dataset will be uploaded

validate()

Return type

None

Была ли статья полезна?

Предыдущая
Search API domain
Следующая
Tuning domain
Проект Яндекса
© 2025 ООО «Яндекс.Облако»