Save a dataset

Save a dataset#

Your data can be stored in various places; on your local machine’s disk, or as in in-memory data structures like Arrow tables, Python dictionaries and Pandas DataFrames. This guide will show you how to do this.

Persisted file formats#

Datasets (Dataset and DatasetDict) can be stored as local files on your computer, or in the cloud. The datasets could be stored as a parquet, csv, or json file. The molflux.datasets.save_dataset_to_store() function can save your datasets as each of these file types.

Hint

This will work automatically for both local and cloud data. If you need more fine-grained control over the filesystem, you can pass your own fsspec-compatible filesystem object to load_dataset_from_store() as an argument to the fs parameter.

For convenience, we also make available a custom AWS S3 Filesystem fsspec implementation which you can create with fsspec.filesystem("s3").

Parquet#

Parquet files are stored in a columnar format, unlike row-based files like a CSV. Large datasets may be stored in a Parquet file because it is more efficient and faster at returning your query.

To save a dataset to Parquet:

from molflux.datasets import save_dataset_to_store

save_dataset_to_store(dataset, path="s3://my-bucket/my_file.parquet")

You can also save DatasetDicts. In this case, the target path should point at a directory where the individual splits will be saved.

from molflux.datasets import save_dataset_to_store

save_dataset_to_store(dataset_dict, path="s3://my-bucket/data")

For other persistence formats, a format will need to be specified to tell molflux.datasets which file format the DataseDict should be saved as.

CSV#

You can store your dataset as CSV:

from molflux.datasets import save_dataset_to_store

# save a Dataset
save_dataset_to_store(dataset, path="my_file.csv")

# save a DatasetDict
save_dataset_to_store(dataset_dict, path="my/data", format="csv")

JSON#

You can store your dataset as JSON as shown below:

from molflux.datasets import save_dataset_to_store

# save a Dataset
save_dataset_to_store(dataset, path="my_file.json")

# save a DatasetDict
save_dataset_to_store(dataset_dict, path="my/data", format="json")

Disk#

You can store your dataset on disk as a collection of .arrow file(s) and the dataset’s json metadata:

from molflux.datasets import save_dataset_to_store

# save a Dataset
save_dataset_to_store(dataset, path="my/dataset/dir")

# save a DatasetDict
save_dataset_to_store(dataset_dict, path="my/data", format="disk")