This summer, SandboxAQ released the Structure Extension IC50 Repository (SAIR). This is the largest dataset of cofolded 3D protein-ligand structures combined with experimentally measured IC₅₀ labels, directly linking molecular structure to drug efficacy and overcoming the long-standing lack of training data. This dataset is now available on Hugging Face, giving researchers, for the first time, open access to over 5 million highly accurate protein-ligand 3D structures generated by AI, each combined with validated empirical binding potency data.
SAIR is an open source dataset, freely available under a permissive CC BY 4.0 license, and ready for use in commercial and non-commercial R&D pipelines. SAIR is more than just a dataset, it is a strategic asset that fills a long-standing data gap in AI-enabled drug design. This will enable pharma, biotech, and tech bio leaders to accelerate R&D, expand coverage, and power AI models, moving much of the costly and time-consuming drug design and optimization from the wet lab to in silico. This means faster hit-to-lead timelines, more efficient lead optimization, fewer dead-end projects, and a more predictable path from initial idea to clinical candidate.
AI and computer-aided design have great potential to dramatically accelerate the development of new drugs. For decades, scientists have dreamed of an AI that could identify or design powerful, nontoxic, and effective compounds from prompts that describe disease pathways. This effectively compresses years of drug research and development into minutes on a computer. However, this vision is bottlenecked by AI’s ability to predict important drug properties such as efficacy and toxicity based solely on molecular structure.
Additionally, traditional structure-based discovery often slows down early to determine reliable 3D structures. Three-dimensional molecular structure determines the function, dynamics, and interactions of molecules. This is particularly important when potential drug candidates are expected to bind to human protein targets.
Although experimental methods such as X-ray crystallography and cryo-EM require significant time and investment, many promising disease targets still lack experimentally verified structural information. Computer simulations have helped lower the barriers to obtaining 3D structures and predicting binding affinities. However, early generation algorithms for protein folding and docking (such as AlphaFold and Vina, respectively) predict only static snapshots of molecules and proteins (in reality, they are inherently dynamic and shape-changing).
SAIR solves that constraint by compiling over 1 million unique computationally cofolded protein-ligand pairs, ultimately generating 5.24 million distinct 3D complexes (5 distinct cofold structures for each pair). Each structure is combined with hand-picked IC₅₀ measurements from ChEMBL or BindingDB, providing for the first time a scalable link between high-quality 3D structures and drug efficacy, closing historical data gaps that have hindered AI-driven discovery. Deep learned affinity models, such as Boltz-2, trained on similar data have been shown to deliver up to 1,000x speedups compared to traditional ab initio approaches.
The creation of SAIR was a major achievement in high-performance AI computing. Leveraging NVIDIA DGX Cloud via Google Cloud Platform, it took over 130,000 GPU hours to compute the SAIR dataset using Voltz1, a cofolding AI model, on a cluster of 760 NVIDIA H100 processors.
Obtaining highly detailed node, operator, scheduler, and GPU metrics, as well as close collaboration on both infrastructure and workload optimization, enabled the NVIDIA AI Accelerator and SandboxAQ engineering teams to identify bottlenecks, optimize configurations, and achieve the highest workload throughput.
As a result, the two teams were able to achieve over 95% GPU compute utilization to generate the SAIR dataset. This enabled us to create SAIR in three weeks, as opposed to the original estimate of three months (more than 4x speedup), resulting in a highly optimized GPU-native computational workflow that seamlessly integrates with today’s cutting-edge enterprise computing environments.
Generating such large amounts of data is only half the battle. Equally important is confidence in its quality. As such, all predicted complexes underwent rigorous validation using PoseBusters, an industry-standard open-source tool for benchmarking structure-related AI in drug discovery. This tool checks chemical soundness and physical plausibility.
As a final result, 97% of SAIR’s structures passed all checks. In addition to validating PoseBusters, we benchmarked key affinity prediction methods, including empirical scoring functions, 3D CNNs, and graph neural networks, across SAIR synthetic structures and experimental IC₅₀ values. Detailed results from these studies are available in scientific publications on bioRxiv.
SAIR data is a reliable foundation for downstream modeling, screening, and design, as well as benchmarking new models.
A persistent challenge in drug discovery is the “dark proteome,” or disease-associated proteins for which no experimental structure exists. SAIR illuminates these unknown areas by providing a complex of reliable AI predictions, even when experimental data is lacking. For example, more than 40 percent of the proteins in the SAIR dataset have no structures available at all in the Protein Data Bank (PDB), with or without ligand. SAIR addresses one of the biggest challenges with existing AI models: their lack of generality due to lack of data. Using SAIR, scientists can now explore targets previously thought to be untreatable, with structural hypotheses to guide virtual screening and optimization using reliable model predictions.
Additionally, the breadth of SAIR’s cross-targeting reveals polypharmacological patterns and elucidates how a single molecule interacts with multiple proteins. This rich tapestry of interactions can be leveraged to train AI models to predict off-target effects and identify new reuse opportunities, giving organizations a deeper understanding of compound profiles before work begins in the lab.
Access to SAIR
SAIR is available for free at Hugging Face. Here’s a quick guide to getting SAIR from Hugging Face, browsing the main table, and (optionally) downloading some structure archives.
1. Install the essentials
Use Hub to retrieve files and pandas+pyarrow to read Parquet.
pip install huggingface_hub pandas pyarrow
2. Authenticate
Face-hugging authentication:
import hugface_hub hugface_hub.login(token=“Your authentication token”)
3. Load the main table (sair.parquet).
This will retrieve the file from the hub and load it into a DataFrame.
from hug face hub import hf_hub_download
import panda as pd parquet_path = hf_hub_download( repo_id=“Sandbox AQ/SAIR”filename=“Sea. Marquetry”lipotype =“Dataset”
) df = pd.read_parquet(parquet_path) df.head()
4. (Optional) List available structure archives.
Structural files are shipped as a number of .tar.gz archives under structural_compressed/. List them and choose the one you need.
from hug face hub import list_repo_files files = (f.split(“https://huggingface.co/”)(-1) for f in list_repo_files(“Sandbox AQ/SAIR”lipotype=“Dataset”)
if Starts with f.(“Structure_compression/”) and f.endswith(“.tar.gz”)) file(:5)
5. (Optional) Download and extract the structure
Each archive can be large (approximately 10 GB). Download only what you need and extract it locally.
import OS, tar file
from hug face hub import hf_hub_download destination = “sair_structs”
os.makedirs(dest, exist_ok=truth) to_get = (
“sair_structs_1006049_to_1016517.tar.gz”,
“sair_ Structures_100623_to_111511.tar.gz”,)
for name in to_get: tar_path = hf_hub_download( repo_id=“Sandbox AQ/SAIR”filename=f” structure_compression/{name}”lipotype =“Dataset”local_dir=destination, local_dir_use_symlinks=error,)
and tar file.open(tar_path, “r:gz”) as tar: tar.extractall(dest) os.remove(tar_path)
A complete version of this script, including more robust logging and validation, is available in the README file. To learn more, visit the SAIR homepage, read the manuscript on bioRxiv, or watch our 25-minute joint webinar with NVIDIA. There, we will demonstrate SAIR and explain how data is structured within SAIR. Extensive documentation, tutorials, and sample benchmarks are provided to facilitate use and accelerate internal adoption.
The future of drug discovery is data-driven, AI-accelerated, and based on scalable, high-quality structural insights. While there is no AI yet that can design effective drug therapies with prompts alone, SAIR brings researchers closer to that goal with new data and insights that could potentially shave years off even AI-accelerated R&D pipelines.
I can’t wait to see what researchers build with SAIR. SandboxAQ experts support researchers throughout the discovery process.
Have a question?
Please contact the author or post on the SAIR dataset discussion page.
Author: Arman Zaribafyan, Georgia Channing, Zane Beckwith, Rudy Plesch

