Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex DataSphere
  • Getting started
    • All guides
      • Using secrets
      • Working with Yandex Data Processing templates
      • Working with Docker images
      • Working with datasets
      • Working with S3 connectors
      • Working with Spark connectors
      • Working with models
      • Using file storages
    • Migrating a workflow to a new version
  • Terraform reference
  • Audit Trails events
  • Access management
  • Pricing policy
  • Public materials
  • Release notes

In this article:

  • Creating and initializing a dataset
  • Populating a dataset with files from the project storage or an attached object storage
  • Populating a dataset from a file available at the link
  • Populating a dataset with file storage objects
  • Enabling a dataset
  • Viewing a list of datasets available in the project
  • Sharing a dataset
  • Deactivating a dataset
  • Deleting a dataset
  1. Step-by-step guides
  2. Working with resources
  3. Working with datasets

Working with datasets

Written by
Yandex Cloud
Improved by
Danila N.
Updated at April 22, 2025
  • Creating and initializing a dataset
    • Populating a dataset with files from the project storage or an attached object storage
    • Populating a dataset from a file available at the link
    • Populating a dataset with file storage objects
  • Enabling a dataset
  • Viewing a list of datasets available in the project
  • Sharing a dataset
  • Deactivating a dataset
  • Deleting a dataset

Use datasets if you need to handle large data amounts in DataSphere. For key operations with datasets, run #pragma dataset utility commands from the cell code.

Creating and initializing a datasetCreating and initializing a dataset

To create and initialize a dataset, use a Bash or Python cell with the #pragma dataset init command. When initializing, specify a project-unique dataset name and the size of the disk to host the dataset. The dataset naming requirements are as follows:

  • It must be from 2 to 63 characters long.
  • It may contain lowercase Latin letters, numbers, and hyphens.
  • It must start with a letter and cannot end with a hyphen.

Note

The dataset initialization process will allocate the entire requested disk space, though a portion will be used by the file system. Specify the dataset size in GB rounded up to the nearest integer.

Tip

The more disk space is allocated for a dataset, the higher is the data read speed.

During initialization, the dataset gets a one-time read and write connection to the project. If the code in the dataset initialization cell runs successfully, the dataset is saved and reconnected with read-only access permissions. If initialization fails, the dataset will be disconnected and deleted.

After initialization is complete, you will not be able to change the data in the dataset. To update the data, delete the dataset and create a new one.

Populating a dataset with files from the project storage or an attached object storagePopulating a dataset with files from the project storage or an attached object storage

You can also use this method to create a dataset with data from a Yandex Object Storage bucket or another object storage attached to the storage via an S3 connector.

Bash

To create a dataset called <dataset_name>, initialize it and copy files from the <source_folder_name> folder:

#!:bash
#pragma dataset init <dataset_name> --size 1Gb

set -e
cp -r <source_folder_name> /home/jupyter/mnt/datasets/<dataset_name>

Populating a dataset from a file available at the linkPopulating a dataset from a file available at the link

To create a dataset called <dataset_name> from a CIFAR-10 archive, run the following code from a cell:

Bash
Python 3
#!:bash
#pragma dataset init <dataset_name> --size 1Gb

set -e
cd /home/jupyter/mnt/datasets/<dataset_name>
wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvf cifar-10-python.tar.gz
rm -rf cifar-10-python.tar.gz
#pragma dataset init <dataset_name> --size 1Gb

from urllib.request import urlopen
import tarfile
from os import remove

file_name = 'cifar-10-python.tar.gz'
file_url = 'https://www.cs.toronto.edu/~kriz/' + file_name
dest_dir = '/home/jupyter/mnt/datasets/<dataset_name>/'
dest_file = dest_dir + file_name

with urlopen(file_url) as i:
  with open(dest_file, 'w+b') as o:
    o.write(i.read())

tar = tarfile.open(dest_file)
tar.extractall(dest_dir)
tar.close()

remove(dest_file)

# This prevents serialization of temporary variables
del i, o, tar

Populating a dataset with file storage objectsPopulating a dataset with file storage objects

Object Storage
Yandex Disk
Google Drive

If you are connecting to an Object Storage bucket using an S3 connector, create a dataset from objects as if they were local files in your project.

#pragma dataset init <dataset_name> --size 1Gb

import os
import boto3

S3_CREDS = {
    "aws_access_key_id": os.environ['<secret_with_access_key_ID>'],
    "aws_secret_access_key": os.environ['<secret_with_secret_key>']
}
bucket_name = "<bucket_name>"

source_path = ''
target_path = '/home/jupyter/mnt/datasets/<dataset_name>/'

s3r = boto3.resource(service_name='s3', endpoint_url='https://storage.yandexcloud.net', **S3_CREDS)
bucket = s3r.Bucket(bucket_name)

for obj in bucket.objects.filter(Prefix=source_path):
    if not os.path.relpath(obj.key, source_path).startswith('../'):
        os.makedirs(os.path.join(target_path, os.path.dirname(obj.key)), exist_ok=True)
        if obj.key[-1] != '/':
            bucket.download_file(obj.key, os.path.join(target_path, obj.key))

Where:

  • aws_access_key_id: ID of the static access key generated for the project service account.
  • aws_secret_access_key: Secret key generated for that service account.

Initialize the dataset from a cell with the following code:

#pragma dataset init <dataset_name> --size 8Gb

import requests
from urllib.parse import urlencode
from io import BytesIO
from zipfile import ZipFile

base_url = 'https://cloud-api.yandex.net/v1/disk/public/resources/download?'
public_key = '<link_to_folder_on_Yandex_Disk>'

final_url = base_url + urlencode(dict(public_key=public_key))
response = requests.get(final_url)
download_url = response.json()['href']
response = requests.get(download_url)

dist_path = '/home/jupyter/mnt/datasets/<dataset_name>/'
zipfile = ZipFile(BytesIO(response.content))
zipfile.extractall(path=dist_path)

Where <link_to_folder_on_Yandex_Disk> is the link to the Yandex Disk folder whose contents you need to import to DataSphere.

  1. Install the gdown package:

    %pip install gdown
    
  2. Initialize the dataset from a cell with the following code:

    #pragma dataset init <dataset_name> --size 1Gb
    
    import gdown
    
    gdrive_folder_id = '<Google_Drive_folder_ID>'
    dst_path = '/home/jupyter/mnt/datasets/<dataset_name>/'
    
    gdown.download_folder(id=gdrive_folder_id, output=dst_path, use_cookies=False)
    

    Where <Google_Drive_folder_ID> is the ID of the Google Drive folder specified in the URL after https://drive.google.com/drive/folders/. For example, in the https://drive.google.com/drive/folders/exampleId URL, the folder ID is exampleId.

Enabling a datasetEnabling a dataset

Once initialized, the dataset becomes enabled in the project and available at /home/jupyter/mnt/datasets/<dataset_name>. You can have up to three datasets enabled within a project at the same time. During your work, you can enable datasets whenever needed and disable them when the data is no longer in use.

To enable a dataset:

  1. Select the relevant project in your community or on the DataSphere homepage in the Recent projects tab.

  2. Under Project resources, click Dataset.
  3. In the line with the dataset in question, click and select Activate.

Viewing a list of datasets available in the projectViewing a list of datasets available in the project

On the project page under Project resources, select Dataset. Next, go to the Shared tab.

Sharing a datasetSharing a dataset

Note

You can only share resources within a single organization between communities created in the same availability zone.

  1. Select the relevant project in your community or on the DataSphere homepage in the Recent projects tab.

  2. Under Project resources, click Dataset.
  3. Select the dataset from the list.
  4. Go to the Access tab.
  5. Enable the visibility option next to the name of the community you want to share the dataset with.

To make a dataset available for use in another project, you need to add it in the Shared tab.

Deactivating a datasetDeactivating a dataset

  1. Select the relevant project in your community or on the DataSphere homepage in the Recent projects tab.

  2. On the project page under Project resources, click Dataset.
  3. In the line with the dataset in question, click and select Deactivate.

Deleting a datasetDeleting a dataset

You can delete a dataset only if it is:

  • Not enabled in the current project.
  • Not used in other projects.
  • Unavailable to the community.

To delete a dataset:

  1. On the project page under Project resources, click Dataset.
  2. In the line with the dataset in question, click and select Delete.

Warning

The actual deletion of resources can take up to 72 hours.

Was the article helpful?

Previous
Working with Docker images
Next
Working with S3 connectors
© 2025 Direct Cursus Technology L.L.C.