More data loading options#
Your datasets can be stored in various places; they can be in the molflux.datasets
catalogue, on your local machine’s disk,
on a remote disk, on the HuggingFace datasets hub, and in in-memory data structures such as Arrow tables, Python dictionaries, and Pandas
DataFrames. In all of these cases, molflux.datasets
can load it.
HuggingFace Hub#
Datasets are loaded from a dataset loading script that downloads and generates the dataset. However, you can also load a
dataset from any dataset repository on the HuggingFace Hub without a loading script!
You just need to use the molflux.datasets.load_dataset()
function to load the dataset.
For example, try loading the files from this demo repository by providing the repository namespace and dataset name. This dataset repository contains CSV files, and the code below loads the dataset from the CSV files:
from molflux.datasets import load_dataset
dataset = load_dataset("lhoestq/demo1")
Some datasets may have more than one version based on Git tags, branches, or commits. Use the revision parameter to specify the dataset version you want to load:
from molflux.datasets import load_dataset
dataset = load_dataset(
"lhoestq/custom_squad",
revision="main" # tag name, or branch name, or commit hash
)
The MolFlux catalogue#
Similarly, to load a dataset from the molflux
catalogue:
from molflux.datasets import load_dataset
dataset = load_dataset("esol")
Remember that you can have a look at the datasets available in the catalogue by doing
from molflux.datasets import list_datasets
catalogue = list_datasets()
print(catalogue)
{'core': ['ani1x', 'ani2x', 'esol', 'gdb9', 'pcqm4m_v2', 'spice'], 'tdc': ['tdc_admet_benchmarks']}
Persisted file formats#
Datasets can also be loaded from local files stored on your computer or in the cloud. The datasets could be
stored as parquet
, csv
, json
, or txt
files. The molflux.datasets.load_dataset_from_store()
function can load each
of these file types
Hint
This will work automatically for both local and cloud data. If you need more fine-grained control over the filesystem,
you can pass your own fsspec
-compatible filesystem object to load_from_store()
as an argument to the fs
parameter.
For convenience, we have also made available a custom AWS S3 Filesystem fsspec
implementation which you can create with
fsspec.filesystem("s3")
.
CSV#
You can read a dataset made up of one or several CSV files:
from molflux.datasets import load_dataset_from_store
dataset = load_dataset_from_store("my_file.csv")
If you are working with partitioned files, you can also load several CSV files at once:
data_files = ["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"]
dataset_dict = load_dataset_from_store(data_files)
You can also map the training and test splits to specific CSV files, and load them in as a DatasetDict
or as a single
Dataset
:
data_files = {"train": ["my_train_file_1.csv", "my_train_file_2.csv"], "test": "my_test_file.csv"}
# load as a DatasetDict
dataset_dict = load_dataset_from_store(data_files)
# load just the 'train' split
dataset = load_dataset_from_store(data_files, split="train")
# merge all splits into a single Dataset
dataset = load_dataset_from_store(data_files, split="all")
To load remote CSV files via HTTP, pass the URLs instead:
base_url = "https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/"
data_files = {'train': base_url + 'train.csv', 'test': base_url + 'test.csv'}
dataset_dict = load_dataset_from_store(data_files)
To load zipped CSV files you might need to explicitly provide the persistence format:
data_files = "data.zip"
dataset = load_dataset_from_store(data_files, format="csv")
Parquet#
Parquet files are stored in a columnar format, unlike row-based files like a CSV. Large datasets may be stored in a Parquet file because it is more efficient and faster at returning your query.
You can load Parquet files in the same way as the CSV examples shown above. For example, to load a Parquet file:
from molflux.datasets import load_dataset_from_store
dataset = load_dataset_from_store("my_file.parquet")
You can also map the training and test splits to specific Parquet files:
from molflux.datasets import load_dataset_from_store
data_files = {'train': 'train.parquet', 'test': 'test.parquet'}
dataset_dict = load_dataset_from_store(data_files)
To load remote Parquet files via HTTP, pass the URLs instead:
base_url = "https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/"
data_files = {"train": base_url + "wikipedia-train.parquet"}
wiki = load_dataset_from_store(data_files, split="train")
JSON#
JSON files are loaded directly as shown below:
from molflux.datasets import load_dataset_from_store
dataset = load_dataset_from_store("my_file.json")
JSON files have diverse formats, but we think the most efficient format is to have multiple JSON objects; each line represents an individual row of data. For example:
{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}
Another JSON format you may encounter is a nested field, in which case you’ll need to specify the field argument as shown in the following:
{"version": "0.1.0",
"data": [{"a": 1, "b": 2.0, "c": "foo", "d": false},
{"a": 4, "b": -5.5, "c": null, "d": true}]
}
dataset = load_dataset_from_store("my_file.json", field="data")
To load remote JSON files via HTTP, pass the URLs instead:
base_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
data_files = {"train": base_url + "train-v1.1.json", "validation": base_url + "dev-v1.1.json"}
dataset = load_dataset_from_store(data_files, field="data")
Disk#
You can read datasets that you have previously saved with molflux.datasets.save_dataset_to_store(..., format="disk")
.
You just need to provide the path to the directory holding your .arrow
file(s), and the dataset’s json
metadata:
from molflux.datasets import load_dataset_from_store
dataset = load_dataset_from_store("my/dataset/dir")
In-memory data#
To create a datasets directly from in-memory data structures like Arrow Tables, Python dictionaries and Pandas DataFrames
you can use directly HuggingFace’s datasets.Dataset
and datasets.DatasetDict
class methods.