SaganMC: A Molecular Complexity Dataset with Mass Spectra
Summary
SaganMC is a machine learning-ready dataset designed for molecular complexity prediction, spectral analysis, and chemical discovery. Molecular complexity metrics quantify how structurally intricate a molecule is, reflecting how difficult it is to construct or synthesize.
The dataset includes 406,446 molecules. A subset of 16,653 molecules includes experimental mass spectra. We provide standard representations (SMILES, InChI, SELFIES), RDKit-derived molecular descriptors, Morgan fingerprints, and three complementary complexity scores: Bertz, Böttcher, and the Molecular Assembly Index (MA). MA scores, computed using code from the Cronin Group, are especially relevant to astrobiology research as potential agnostic biosignatures. Assigning MA indices to molecules is compute intensive, and generating this dataset required over 100,000 CPU hours on Google Cloud.
SaganMC is named in honor of Carl Sagan, the astronomer and science communicator whose work inspired generations to explore life beyond Earth. The initial version of this dataset was produced during a NASA Frontier Development Lab (FDL) astrobiology sprint.
Intended Uses
* Train machine learning models to predict molecular complexity directly from molecular structure or mass spectrometry data.
* Develop surrogate models to approximate Molecular Assembly Index (MA) scores efficiently at large scale.
* Benchmark complexity metrics (Bertz, Böttcher, MA) across diverse molecular classes.
* Enable onboard ML pipelines for spacecraft to prioritize high-complexity chemical targets during exploration.
* Explore correlations between molecular complexity and experimental observables such as mass spectra.
* Support AI-driven chemical discovery tasks.
Available Formats
CSV
The original dataset is in CSV format.
* SaganMC-400k (sagan-mc-400k.csv): The full dataset with 406,446 molecules, including structural and complexity features.
* SaganMC-Spectra-16k (sagan-mc-spectra-16k.csv): A 16,653-molecule subset of the full dataset, with experimental mass spectra from NIST.
(Description continued in ReadMe.txt file)
molecule
,drug-discovery
,chemistry
,molecular-complexity
,mass-spectrometry
,astrobiology
,biology