Working with datasets
Use datasets if you need to handle large data amounts in DataSphere. For key operations with datasets, run #pragma dataset
utility commands from the cell code.
Create and initialize a dataset
To create and initialize a dataset, use a Bash or Python cell with the #pragma dataset init
command. At initialization, specify a project-unique dataset name and the size of the disk to host the dataset. The dataset naming requirements are as follows:
- It must be 2 to 63 characters long.
- It may contain lowercase Latin letters, numbers, and hyphens.
- It must start with a letter and cannot end with a hyphen.
The dataset initialization process will allocate the entire requested amount of disk space but the file system will take up a part of this space. Specify the dataset size in GB rounded up to the nearest integer.
The larger is the disk allocated for the dataset, the higher is the data read speed.
During initialization, the dataset makes a one-time connection to the project with read and write access. If the code in the dataset initialization cell executes successfully, the dataset is saved and reconnected with read-only access. If an error occurs during initialization, the dataset will be disconnected and deleted.
After initialization, dataset data cannot be changed. If the data needs to be updated, delete the dataset and create a new one.
Populate the dataset with files from the project storage or an attached object storage
This method is also good for creating a dataset with data from an Yandex Object Storage bucket or another object storage attached to the storage via an S3 connector.
To create the <dataset_name>
dataset, initialize it and copy files from the <source_folder_name>
#pragma dataset init <dataset_name> --size 1Gb
set -e
cp -r <source_folder_name> /home/jupyter/mnt/datasets/<dataset_name>
Populate the dataset from a file by a link
To create a dataset named <dataset_name>
from a CIFAR-10
#pragma dataset init <dataset_name> --size 1Gb
set -e
cd /home/jupyter/mnt/datasets/<dataset_name>
tar -xvf cifar-10-python.tar.gz
rm -rf cifar-10-python.tar.gz
#pragma dataset init <dataset_name> --size 1Gb
from urllib.request import urlopen
import tarfile
from os import remove
file_name = 'cifar-10-python.tar.gz'
file_url = '' + file_name
dest_dir = '/home/jupyter/mnt/datasets/<dataset_name>/'
dest_file = dest_dir + file_name
with urlopen(file_url) as i:
with open(dest_file, 'w+b') as o:
tar =
# This prevents serialization of temporary variables
del i, o, tar
Populate a dataset from file storage objects
If you are connecting to an Object Storage bucket using an S3 connector, create a dataset from objects as if they were local files in your project.
#pragma dataset init <dataset_name> --size 1Gb
import os
import boto3
S3_CREDS = {
"aws_access_key_id": os.environ['<secret_with_access_key_ID>'],
"aws_secret_access_key": os.environ['<secret_with_secret_key>']
bucket_name = "<bucket_name>"
source_path = ''
target_path = '/home/jupyter/mnt/datasets/<dataset_name>/'
s3r = boto3.resource(service_name='s3', endpoint_url='', **S3_CREDS)
bucket = s3r.Bucket(bucket_name)
for obj in bucket.objects.filter(Prefix=source_path):
if not os.path.relpath(obj.key, source_path).startswith('../'):
os.makedirs(os.path.join(target_path, os.path.dirname(obj.key)), exist_ok=True)
if obj.key[-1] != '/':
bucket.download_file(obj.key, os.path.join(target_path, obj.key))
: ID of the static access key generated for the project service account.aws_secret_access_key
: Secret key generated for that service account.
Initialize the dataset in a cell with the following code:
#pragma dataset init <dataset_name> --size 8Gb
import requests
from urllib.parse import urlencode
from io import BytesIO
from zipfile import ZipFile
base_url = ''
public_key = '<Yandex_Disk_folder_link>'
final_url = base_url + urlencode(dict(public_key=public_key))
response = requests.get(final_url)
download_url = response.json()['href']
response = requests.get(download_url)
dist_path = '/home/jupyter/mnt/datasets/<dataset_name>/'
zipfile = ZipFile(BytesIO(response.content))
Where <Yandex_Disk_folder_link>
is the URL of the folder on Yandex Disk whose contents need to be imported to DataSphere.
Install the gdown
package:%pip install gdown
Initialize the dataset in a cell with the following code:
#pragma dataset init <dataset_name> --size 1Gb import gdown gdrive_folder_id = '<Google_Drive_folder_ID>' dst_path = '/home/jupyter/mnt/datasets/<dataset_name>/' gdown.download_folder(id=gdrive_folder_id, output=dst_path, use_cookies=False)
is the ID of the Google Drive folder specified in the URL after
. For example, in the
URL, the folder ID isexampleId
Enabling a dataset
Once initialized, the dataset becomes enabled in the project and available at the path: /home/jupyter/mnt/datasets/<dataset_name>
. You can have up to three datasets activated in a project at the same time. During your work, you can activate datasets whenever needed and disable them when you no longer need the data.
To enable a dataset:
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Under Project resources, click
Dataset. - In the line with the appropriate dataset, click
and select Activate.
Viewing a list of datasets available in the project
On the project page under Project resources, select
Sharing a dataset
You can only share resources within a single organization between communities created in the same availability zone.
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Under Project resources, click
Dataset. - Select the appropriate dataset from the list.
- Go to the Access tab.
- Enable the visibility option next to the name of the community to share the dataset in.
To make a dataset available for use in another project, you should add it to the Shared tab.
Deactivating a dataset
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - On the project page under Project resources, click
Dataset. - In the line with the appropriate dataset, click
and select Deactivate.
Deleting a dataset
You can delete a dataset only if it is:
- Not enabled in any of your projects.
- Not used in other projects.
- Unavailable to the community.
To delete a dataset:
- On the project page under Project resources, click
Dataset. - In the line with the appropriate dataset, click
and select Delete.
The actual deletion of resources can take up to 72 hours.