Datasets in a nutshell

Datasets in a nutshell#

127 words | 1 min read

The cornerstone of machine learning is data. Datasets for drug discovery in particular are numerous and rapidly evolving. New datasets are constantly being created but accessing them from different sources quickly becomes tedious, inconvenient, and prone to incompatibilities.

The datasets submodule aims to address these issues. It is a library of many different datasets from multiple sources. Whether you are looking for public datasets (such as ESOL or QM9) or just easy access to saved data, datasets provides a standard and modular interface for accessing and manipulating these datasets!

It is built on top of the [HuggingFace] (https://huggingface.co/docs/datasets/index) datasets package. The datasets package is versatile, fast, and efficient. It can handle many data types and its built-in functionality allows for scalable and fast data manipulation and handling.