Datasets domain

Статья создана

Yandex Cloud

Обновлена 7 ноября 2025 г.

class yandex_cloud_ml_sdk._datasets.domain.AsyncDatasets
class yandex_cloud_ml_sdk._datasets.dataset.AsyncDataset
class yandex_cloud_ml_sdk._datasets.draft.AsyncDatasetDraft

class yandex_cloud_ml_sdk._datasets.domain.AsyncDatasets

This class provides methods to create and manage datasets of a specific type.

async get(dataset_id, *, timeout=60)

Fetch a dataset from the server using its ID.

Parameters

dataset_id (str) – the unique identifier of the dataset to fetch.
timeout (float) – the time to wait for the request. Defaults to 60 seconds.

Return type

AsyncDataset

async list(*, status=Undefined, name_pattern=Undefined, task_type=Undefined, timeout=60)

Fetch a list of datasets based on specified filters.

Parameters

status (str | Undefined | DatasetStatus | Iterable[str | DatasetStatus]) – the status filter for datasets; can be a single status or an iterable of statuses.
name_pattern (str | Undefined) – a pattern to filter dataset names.
task_type (str | Undefined | Iterable[str]) – the type of task associated with the datasets; can be a single task type or an iterable of task types.
timeout (float) – the time to wait for the request. Defaults to 60 seconds.

Return type

AsyncIterator[AsyncDataset]

async list_upload_formats(task_type, *, timeout=60)

Fetch available upload formats for a specified task type.

Parameters

task_type (str) – the type of task for which to fetch upload formats.
timeout (float) – the time to wait for the request in seconds. Defaults to 60 seconds.

Return type

tuple[str, …]

async list_upload_schemas(task_type, *, timeout=60)

Fetch available upload schemas for a specified task type.

Parameters

task_type (str) – the type of task for which to fetch upload schemas.
timeout (float) – the time to wait for the request in seconds. Defaults to 60 seconds.

Return type

tuple[DatasetUploadSchema, …]

completions

a helper for autocompletion text-to-text generation tasks

draft_from_path(path, *, task_type=Undefined, upload_format=Undefined, name=Undefined, description=Undefined, metadata=Undefined, labels=Undefined, allow_data_logging=Undefined)

Create a new dataset draft from a specified path.

Parameters

path (str | PathLike) – the path to the data file or directory.
task_type (str | Undefined) – the type of task for the dataset.
upload_format (str | Undefined) – the format in which the data should be uploaded.
name (str | Undefined) – the name of the dataset.
description (str | Undefined) – a description of the dataset.
metadata (str | Undefined) – metadata associated with the dataset.
labels (dict[str, str] | Undefined) – a set of labels for the dataset.
allow_data_logging (bool | Undefined) – a flag indicating if data logging is allowed.

Return type

DatasetDraftT

text_classifiers_binary

a helper for autocompletion binary text classification tasks

text_classifiers_multiclass

a helper for autocompletion multiclass text classification tasks

text_classifiers_multilabel

a helper for autocompletion multilabel text classification tasks

text_embeddings_pair

a helper for autocompletion pairwise text embeddings tasks

text_embeddings_triplet

a helper for autocompletion triplet text embeddings tasks

class yandex_cloud_ml_sdk._datasets.dataset.AsyncDataset

async update(*, name=Undefined, description=Undefined, labels=Undefined, timeout=60)

Updates the dataset with the provided parameters.

Parameters

name (str | Undefined) – the name for the dataset.
description (str | Undefined) – the description for the dataset.
labels (dict[str, str] | Undefined) – a set of labels for the dataset.
timeout (float) – the timeout, or the maximum time to wait for the update request. Defaults to 60 seconds.

Return type

Self

async delete(*, timeout=60)

Deletes the dataset.

Parameters	timeout (float) – the timeout, or maximum time to wait for the delete request. Defaults to 60 seconds.
Return type	None

async list_upload_formats(*, timeout=60)

Retrieve a list of upload formats for the dataset.

Parameters	timeout (float) – the maximum time to wait for the operation to complete. Defaults to 60 seconds.
Return type	tuple[str, …]

async download(*, download_path, timeout=60, exist_ok=False, max_parallel_downloads=16)

Download a dataset to the specified path.

Parameters

download_path (str | PathLike) – the path where the dataset will be downloaded.
timeout (float) – the timeout, or maximum time to wait for the download. Defaults to 60 seconds.
exist_ok (bool) – if True, do not raise an error if files already exist. Defaults to False.
max_parallel_downloads (int) – the maximum number of concurrent downloads.

Return type

tuple[Path, …]

read(*, timeout=60, batch_size=Undefined)

Reads the dataset from backend and yields it records one by one.

This method lazily loads records by chunks, minimizing memory usage for large datasets. The iterator yields dictionaries where keys are field names and values are parsed data.

Примечание

This method creates temporary files in the system’s default temporary directory during operation. To control the location of temporary files, refer to Python’s tempfile.gettempdir() documentation. Temporary files are automatically cleaned up after use.

Parameters

timeout (float) – Maximum time in seconds for both gRPC and HTTP operations. Includes connection establishment, data transfer, and processing time. Defaults to 60 seconds.
batch_size (int | Undefined) – Number of records to load to memory in one chunk. When UNDEFINED (default), uses backend’s optimal chunk size (typically corresponds to distinct Parquet files storage layout).

Yields

Dictionary representing single record with field-value pairs

Return type

AsyncIterator[dict[Any, Any]]

folder_id: str

the ID of the folder which contains the dataset

name: str | None

the name of the dataset

description: str | None

a description of the dataset

metadata: str | None

metadata associated with the dataset

created_by: str

the user who created the dataset

created_at: datetime

the timestamp when the dataset was created

updated_at: datetime

the timestamp when the dataset was last updated

labels: dict[str, str] | None

a dictionary of labels associated with the dataset

allow_data_logging: bool

indicates if data logging is allowed for this dataset

status: DatasetStatus

the current status of the dataset

task_type: str

the type of task associated with the dataset

rows: int

the number of rows in the dataset

size_bytes: int

the size of the dataset in bytes

validation_errors: tuple[ValidationErrorInfo]... ,

a tuple of validation errors associated with the dataset

id: str

class yandex_cloud_ml_sdk._datasets.draft.AsyncDatasetDraft

This class allows users to create a draft representation of a dataset without immediately interacting with the server. This draft serves as a structure for storing configuration settings, enabling users to edit the dataset’s properties before finalizing the upload.

async upload_deferred(*, timeout=60, upload_timeout=360, raise_on_validation_failure=True, chunk_size=104857600, parallelism=None)

Creates a dataset object on the server, uploads data to S3, triggers validation of the created dataset, and waits for its completion.

Parameters

timeout (float) – the time to wait for the dataset creation operation. Defaults to 60 seconds.
upload_timeout (float) – the time to wait for the upload operation. Defaults to 360 seconds.
raise_on_validation_failure (bool) – a flag indicating whether to raise an error if validation fails after the upload. Default is True.
chunk_size (int) – the size of chunks to use when uploading data. Default is defined by DEFAULT_CHUNK_SIZE.
parallelism (int | None) – the level of parallelism for the upload. Default is None, which means no limit.

Return type

AsyncOperation[AsyncDataset]

async upload(*, timeout=60, upload_timeout=360, raise_on_validation_failure=True, poll_timeout=21600, poll_interval=60, chunk_size=104857600, parallelism=None)

This method also performs the upload operation, but unlike _upload_deferred function, which returns an operation object, it directly returns the result of the completed operation.

Parameters

timeout (float) – the time to wait for the upload operation. Defaults to 60 seconds.
poll_timeout (int) – the time to wait for polling the operation status. Default is defined by DEFAULT_OPERATION_POLL_TIMEOUT.
poll_interval (float) – the interval at which to poll for operation status Defaults to 60 seconds).
kwargs – additional keyword arguments passed to _upload_deferred.
upload_timeout (float)
raise_on_validation_failure (bool)
chunk_size (int)
parallelism (int | None)

allow_data_logging: bool | None = None

a flag indicating if iyt is allowed to use the dataset to improve the models quality. Default false.

configure(**kwargs)

Parameters	kwargs (Any)
Return type	Self

description: str | None = None

a description of the dataset

labels: dict[str, str] | None = None

labels for categorizing the dataset

metadata: str | None = None

metadata associated with the dataset

name: str | None = None

the name of the dataset

path: PathLike | None = None

the file path to the dataset

task_type: str | None = None

the type of task associated with the dataset

upload_format: str | None = None

the format in which the dataset will be uploaded

validate()

Return type

None

Datasets domain

class yandexcloudmlsdk.datasets.domain.AsyncDatasetsclass yandex_cloud_ml_sdk._datasets.domain.AsyncDatasets

class yandexcloudmlsdk.datasets.dataset.AsyncDatasetclass yandex_cloud_ml_sdk._datasets.dataset.AsyncDataset

class yandexcloudmlsdk.datasets.draft.AsyncDatasetDraftclass yandex_cloud_ml_sdk._datasets.draft.AsyncDatasetDraft

Была ли статья полезна?

class yandex_cloud_ml_sdk._datasets.domain.AsyncDatasets

class yandex_cloud_ml_sdk._datasets.dataset.AsyncDataset

class yandex_cloud_ml_sdk._datasets.draft.AsyncDatasetDraft