Working with datasets
Use datasets if you need to handle large data amounts in DataSphere. For key operations with datasets, run #pragma dataset utility commands in code cells.
Creating and initializing a dataset
To create and initialize a dataset, use a Bash or Python cell with the #pragma dataset init command. When initializing, specify a project-unique dataset name and the size of the disk to host your dataset. Follow these dataset naming requirements:
- It must be from 2 to 63 characters long.
- It can only contain lowercase Latin letters, numbers, and hyphens.
- It must start with a letter and cannot end with a hyphen.
Note
When you initialize a dataset, the system will allocate the entire requested disk space, though a portion will be used by the file system. Specify the dataset size in GB, rounded up to the nearest integer.
Tip
The larger the disk allocated for the dataset, the higher the data read speed.
During initialization, the dataset gets a one-time read and write connection to the project. If the code in the dataset initialization cell runs successfully, the dataset is saved and reconnected with read-only access permissions. If initialization fails, the dataset will be disconnected and deleted.
After initialization is complete, you will not be able to change the data in the dataset. If you need to update the data, delete the dataset and create a new one.
Populating a dataset with files from the project storage or an attached object storage
You can also use this method to create a dataset with data from a Yandex Object Storage bucket or another object storage attached to your project storage via an S3 connector.
To create a dataset called <dataset_name>, initialize it and copy files from the <source_folder_name> folder:
#!:bash
#pragma dataset init <dataset_name> --size 1Gb
set -e
cp -r <source_folder_name> /home/jupyter/mnt/datasets/<dataset_name>
Populating a dataset from a file available by a link
To create a dataset called <dataset_name> from a CIFAR-10
#!:bash
#pragma dataset init <dataset_name> --size 1Gb
set -e
cd /home/jupyter/mnt/datasets/<dataset_name>
wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvf cifar-10-python.tar.gz
rm -rf cifar-10-python.tar.gz
#pragma dataset init <dataset_name> --size 1Gb
from urllib.request import urlopen
import tarfile
from os import remove
file_name = 'cifar-10-python.tar.gz'
file_url = 'https://www.cs.toronto.edu/~kriz/' + file_name
dest_dir = '/home/jupyter/mnt/datasets/<dataset_name>/'
dest_file = dest_dir + file_name
with urlopen(file_url) as i:
with open(dest_file, 'w+b') as o:
o.write(i.read())
tar = tarfile.open(dest_file)
tar.extractall(dest_dir)
tar.close()
remove(dest_file)
# This prevents serialization of temporary variables
del i, o, tar
Populating a dataset with file storage objects
If you are connecting to an Object Storage bucket using an S3 connector, create a dataset from objects as if they were local files in your project.
#pragma dataset init <dataset_name> --size 1Gb
import os
import boto3
S3_CREDS = {
"aws_access_key_id": os.environ['<secret_with_access_key_ID>'],
"aws_secret_access_key": os.environ['<secret_with_secret_key>']
}
bucket_name = "<bucket_name>"
source_path = ''
target_path = '/home/jupyter/mnt/datasets/<dataset_name>/'
s3r = boto3.resource(service_name='s3', endpoint_url='https://storage.yandexcloud.net', **S3_CREDS)
bucket = s3r.Bucket(bucket_name)
for obj in bucket.objects.filter(Prefix=source_path):
if not os.path.relpath(obj.key, source_path).startswith('../'):
os.makedirs(os.path.join(target_path, os.path.dirname(obj.key)), exist_ok=True)
if obj.key[-1] != '/':
bucket.download_file(obj.key, os.path.join(target_path, obj.key))
Where:
aws_access_key_id: ID of the static access key generated for the project service account.aws_secret_access_key: Secret key generated for that service account.
Initialize the dataset in a code cell:
#pragma dataset init <dataset_name> --size 8Gb
import requests
from urllib.parse import urlencode
from io import BytesIO
from zipfile import ZipFile
base_url = 'https://cloud-api.yandex.net/v1/disk/public/resources/download?'
public_key = '<link_to_folder_on_Yandex_Disk>'
final_url = base_url + urlencode(dict(public_key=public_key))
response = requests.get(final_url)
download_url = response.json()['href']
response = requests.get(download_url)
dist_path = '/home/jupyter/mnt/datasets/<dataset_name>/'
zipfile = ZipFile(BytesIO(response.content))
zipfile.extractall(path=dist_path)
Where <link_to_folder_on_Yandex_Disk> is the link to the Yandex Disk folder whose contents you need to import to DataSphere.
-
Install gdown
:%pip install gdown -
Initialize the dataset in a code cell:
#pragma dataset init <dataset_name> --size 1Gb import gdown gdrive_folder_id = '<Google_Drive_folder_ID>' dst_path = '/home/jupyter/mnt/datasets/<dataset_name>/' gdown.download_folder(id=gdrive_folder_id, output=dst_path, use_cookies=False)Where
<Google_Drive_folder_ID>is the Google Drive folder ID specified in the URL afterhttps://drive.google.com/drive/folders/. For example, in thehttps://drive.google.com/drive/folders/exampleIdURL, the folder ID isexampleId.
Activating a dataset
Once initialized, the dataset becomes active within the project and available at /home/jupyter/mnt/datasets/<dataset_name>. You can have up to three active datasets within a project at the same time. During your work, you can activate datasets whenever needed and disable them when the data is no longer in use.
To activate a dataset:
-
Select the project in your community or on the DataSphere home page
in the Recent projects tab. - Under Project resources, click
Dataset. - Next to the dataset in question, click
and select Activate.
Viewing a list of datasets available in the project
On the project page under Project resources, select
Sharing a dataset
Note
You can only share resources within a single organization, between communities created in the same availability zone.
-
Select the project in your community or on the DataSphere home page
in the Recent projects tab. - Under Project resources, click
Dataset. - Select the dataset from the list.
- Go to the Access tab.
- Enable the visibility option next to the name of the community you want to share the dataset with.
To make a dataset available for use in another project, you need to add that dataset in the Shared tab.
Disabling a dataset
-
Select the project in your community or on the DataSphere home page
in the Recent projects tab. - On the project page under Project resources, click
Dataset. - Next to the dataset in question, click
and select Deactivate.
Deleting a dataset
You can delete a dataset only if it is:
- Not activated for the current project.
- Not used in other projects.
- Unavailable to the community.
To delete a dataset:
- On the project page under Project resources, click
Dataset. - Next the dataset in question, click
and select Delete.
Warning
In fact, resource deletion can take up to 72 hours.