Working with datasets
Use datasets if you need to handle large data amounts in DataSphere. For key operations with datasets, run #pragma dataset
utility commands from the cell code.
Creating and initializing a dataset
To create and initialize a dataset, use a Bash or Python cell with the #pragma dataset init
command. When initializing, specify a project-unique dataset name and the size of the disk to host the dataset. The dataset naming requirements are as follows:
- It must be 2 to 63 characters long.
- It may contain lowercase Latin letters, numbers, and hyphens.
- It must start with a letter and cannot end with a hyphen.
Note
The dataset initialization process will allocate the entire requested disk space, though a portion will be used by the file system. Specify the dataset size in GB rounded up to the nearest integer.
Tip
The more disk space is allocated for a dataset, the higher is the data read speed.
During initialization, the dataset gets a one-time read and write connection to the project. If the code in the dataset initialization cell runs successfully, the dataset is saved and reconnected with read-only access permissions. If initialization fails, the dataset will be disconnected and deleted.
After initialization is complete, you will not be able to change the data in the dataset. To update the data, delete the dataset and create a new one.
Populating a dataset with files from the project storage or an attached object storage
You can also use this method to create a dataset with data from a Yandex Object Storage bucket or another object storage attached to the storage via an S3 connector.
To create a dataset called <dataset_name>
, initialize it and copy files from the <source_folder_name>
folder:
#!:bash
#pragma dataset init <dataset_name> --size 1Gb
set -e
cp -r <source_folder_name> /home/jupyter/mnt/datasets/<dataset_name>
Populating a dataset from a file available at the link
To create a dataset called <dataset_name>
from a CIFAR-10
#!:bash
#pragma dataset init <dataset_name> --size 1Gb
set -e
cd /home/jupyter/mnt/datasets/<dataset_name>
wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvf cifar-10-python.tar.gz
rm -rf cifar-10-python.tar.gz
#pragma dataset init <dataset_name> --size 1Gb
from urllib.request import urlopen
import tarfile
from os import remove
file_name = 'cifar-10-python.tar.gz'
file_url = 'https://www.cs.toronto.edu/~kriz/' + file_name
dest_dir = '/home/jupyter/mnt/datasets/<dataset_name>/'
dest_file = dest_dir + file_name
with urlopen(file_url) as i:
with open(dest_file, 'w+b') as o:
o.write(i.read())
tar = tarfile.open(dest_file)
tar.extractall(dest_dir)
tar.close()
remove(dest_file)
# This prevents serialization of temporary variables
del i, o, tar
Populating a dataset with file storage objects
If you are connecting to an Object Storage bucket using an S3 connector, create a dataset from objects as if they were local files in your project.
#pragma dataset init <dataset_name> --size 1Gb
import os
import boto3
S3_CREDS = {
"aws_access_key_id": os.environ['<secret_with_access_key_ID>'],
"aws_secret_access_key": os.environ['<secret_with_secret_key>']
}
bucket_name = "<bucket_name>"
source_path = ''
target_path = '/home/jupyter/mnt/datasets/<dataset_name>/'
s3r = boto3.resource(service_name='s3', endpoint_url='https://storage.yandexcloud.net', **S3_CREDS)
bucket = s3r.Bucket(bucket_name)
for obj in bucket.objects.filter(Prefix=source_path):
if not os.path.relpath(obj.key, source_path).startswith('../'):
os.makedirs(os.path.join(target_path, os.path.dirname(obj.key)), exist_ok=True)
if obj.key[-1] != '/':
bucket.download_file(obj.key, os.path.join(target_path, obj.key))
Where:
aws_access_key_id
: ID of the static access key generated for the project service account.aws_secret_access_key
: Secret key generated for that service account.
Initialize the dataset from a cell with the following code:
#pragma dataset init <dataset_name> --size 8Gb
import requests
from urllib.parse import urlencode
from io import BytesIO
from zipfile import ZipFile
base_url = 'https://cloud-api.yandex.net/v1/disk/public/resources/download?'
public_key = '<link_to_folder_on_Yandex_Disk>'
final_url = base_url + urlencode(dict(public_key=public_key))
response = requests.get(final_url)
download_url = response.json()['href']
response = requests.get(download_url)
dist_path = '/home/jupyter/mnt/datasets/<dataset_name>/'
zipfile = ZipFile(BytesIO(response.content))
zipfile.extractall(path=dist_path)
Where <link_to_folder_on_Yandex_Disk>
is the link to the Yandex Disk folder whose contents you need to import to DataSphere.
-
Install the gdown
package:%pip install gdown
-
Initialize the dataset from a cell with the following code:
#pragma dataset init <dataset_name> --size 1Gb import gdown gdrive_folder_id = '<Google_Drive_folder_ID>' dst_path = '/home/jupyter/mnt/datasets/<dataset_name>/' gdown.download_folder(id=gdrive_folder_id, output=dst_path, use_cookies=False)
Where
<Google_Drive_folder_ID>
is the ID of the Google Drive folder specified in the URL afterhttps://drive.google.com/drive/folders/
. For example, in thehttps://drive.google.com/drive/folders/exampleId
URL, the folder ID isexampleId
.
Enabling a dataset
Once initialized, the dataset becomes enabled in the project and available at /home/jupyter/mnt/datasets/<dataset_name>
. You can have up to three datasets enabled within a project at the same time. During your work, you can enable datasets whenever needed and disable them when the data is no longer in use.
To enable a dataset:
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Under Project resources, click
Dataset. - In the line with the dataset in question, click
and select Activate.
Viewing a list of datasets available in the project
On the project page under Project resources, select
Sharing a dataset
Note
You can only share resources within a single organization between communities created in the same availability zone.
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Under Project resources, click
Dataset. - Select the dataset from the list.
- Go to the Access tab.
- Enable the visibility option next to the name of the community you want to share the dataset with.
To make a dataset available for use in another project, you need to add it in the Shared tab.
Deactivating a dataset
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - On the project page under Project resources, click
Dataset. - In the line with the dataset in question, click
and select Deactivate.
Deleting a dataset
You can delete a dataset only if it is:
- Not enabled in the current project.
- Not used in other projects.
- Unavailable to the community.
To delete a dataset:
- On the project page under Project resources, click
Dataset. - In the line with the dataset in question, click
and select Delete.
Warning
The actual deletion of resources can take up to 72 hours.