Datasets in DataSphere
A dataset in DataSphere is a way to store information that provides quick access to large amounts of data. Datasets can store up to 4 TB of data, giving faster access to data than to the main project storage.
Tip
The larger is the disk allocated for the dataset, the higher is the data read speed.
A dataset is created and populated during initialization. Once initialized, you cannot change a dataset, as it becomes read-only. If you want to add files to a dataset, create it once again.
Datasets are not included in the main project storage and are priced separately.
To use your data in multiple projects, you can share your datasets within your community, just like any other resources.
When activated in the project, a disk with a dataset is mounted to the project storage. You can read the files of your activated dataset as local files in your project storage, at the following path: /home/jupyter/mnt/datasets/<dataset_name>
.
You can have up to three datasets activated in a project at the same time. You can activate and deactivate datasets in your projects on the go, without having to restart them. For a complete list of DataSphere restrictions, see Quotas and limits in DataSphere.
Information about a dataset as a resource
For each dataset, the following information is stored:
- Name.
- Status of its connection to the project.
- Name of the user who created the dataset.
- Dataset creation date in UTC
format, such asJuly 18, 2022, 14:23
.
To view dataset details, click its name in the project's dataset list. On the dataset's Overview tab, you can see:
- Availability zone the dataset is stored in.
- Size.
- Initialization code.