Datasets in a nutshell#
127 words | 1 min read
The cornerstone of machine learning is data. Datasets for drug discovery in particular are numerous and rapidly evolving. New datasets are constantly being created but accessing them from different sources quickly becomes tedious, inconvenient, and prone to incompatibilities.
The datasets
submodule aims to address these issues. It is a library of many different datasets from
multiple sources. Whether you are looking for public datasets (such as ESOL or QM9) or just easy access to saved data,
datasets
provides a standard and modular interface for accessing and manipulating these datasets!
It is built on top of the [HuggingFace] (https://huggingface.co/docs/datasets/index) datasets
package. The datasets
package is versatile, fast, and efficient. It can handle many data types and its built-in functionality allows for scalable
and fast data manipulation and handling.