Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

OpenAI Frontier collides enterprise AI agents with SaaS

March 16, 2026

Music AI Sandbox adds new features and broader access — Google DeepMind

March 15, 2026

BMW introduces humanoid robots to manufacturing sites across Europe for the first time

March 14, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Monday, March 16
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Accelerate drug research and development with AI-powered structural intelligence
Tools

Accelerate drug research and development with AI-powered structural intelligence

versatileaiBy versatileaiJanuary 25, 2026No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

This summer, SandboxAQ released the Structure Extension IC50 Repository (SAIR). This is the largest dataset of cofolded 3D protein-ligand structures combined with experimentally measured IC₅₀ labels, directly linking molecular structure to drug efficacy and overcoming the long-standing lack of training data. This dataset is now available on Hugging Face, giving researchers, for the first time, open access to over 5 million highly accurate protein-ligand 3D structures generated by AI, each combined with validated empirical binding potency data.

SAIR is an open source dataset, freely available under a permissive CC BY 4.0 license, and ready for use in commercial and non-commercial R&D pipelines. SAIR is more than just a dataset, it is a strategic asset that fills a long-standing data gap in AI-enabled drug design. This will enable pharma, biotech, and tech bio leaders to accelerate R&D, expand coverage, and power AI models, moving much of the costly and time-consuming drug design and optimization from the wet lab to in silico. This means faster hit-to-lead timelines, more efficient lead optimization, fewer dead-end projects, and a more predictable path from initial idea to clinical candidate.

AI and computer-aided design have great potential to dramatically accelerate the development of new drugs. For decades, scientists have dreamed of an AI that could identify or design powerful, nontoxic, and effective compounds from prompts that describe disease pathways. This effectively compresses years of drug research and development into minutes on a computer. However, this vision is bottlenecked by AI’s ability to predict important drug properties such as efficacy and toxicity based solely on molecular structure.

Additionally, traditional structure-based discovery often slows down early to determine reliable 3D structures. Three-dimensional molecular structure determines the function, dynamics, and interactions of molecules. This is particularly important when potential drug candidates are expected to bind to human protein targets.

Although experimental methods such as X-ray crystallography and cryo-EM require significant time and investment, many promising disease targets still lack experimentally verified structural information. Computer simulations have helped lower the barriers to obtaining 3D structures and predicting binding affinities. However, early generation algorithms for protein folding and docking (such as AlphaFold and Vina, respectively) predict only static snapshots of molecules and proteins (in reality, they are inherently dynamic and shape-changing).

SAIR solves that constraint by compiling over 1 million unique computationally cofolded protein-ligand pairs, ultimately generating 5.24 million distinct 3D complexes (5 distinct cofold structures for each pair). Each structure is combined with hand-picked IC₅₀ measurements from ChEMBL or BindingDB, providing for the first time a scalable link between high-quality 3D structures and drug efficacy, closing historical data gaps that have hindered AI-driven discovery. Deep learned affinity models, such as Boltz-2, trained on similar data have been shown to deliver up to 1,000x speedups compared to traditional ab initio approaches.

The creation of SAIR was a major achievement in high-performance AI computing. Leveraging NVIDIA DGX Cloud via Google Cloud Platform, it took over 130,000 GPU hours to compute the SAIR dataset using Voltz1, a cofolding AI model, on a cluster of 760 NVIDIA H100 processors.

Obtaining highly detailed node, operator, scheduler, and GPU metrics, as well as close collaboration on both infrastructure and workload optimization, enabled the NVIDIA AI Accelerator and SandboxAQ engineering teams to identify bottlenecks, optimize configurations, and achieve the highest workload throughput.

As a result, the two teams were able to achieve over 95% GPU compute utilization to generate the SAIR dataset. This enabled us to create SAIR in three weeks, as opposed to the original estimate of three months (more than 4x speedup), resulting in a highly optimized GPU-native computational workflow that seamlessly integrates with today’s cutting-edge enterprise computing environments.

Generating such large amounts of data is only half the battle. Equally important is confidence in its quality. As such, all predicted complexes underwent rigorous validation using PoseBusters, an industry-standard open-source tool for benchmarking structure-related AI in drug discovery. This tool checks chemical soundness and physical plausibility.

As a final result, 97% of SAIR’s structures passed all checks. In addition to validating PoseBusters, we benchmarked key affinity prediction methods, including empirical scoring functions, 3D CNNs, and graph neural networks, across SAIR synthetic structures and experimental IC₅₀ values. Detailed results from these studies are available in scientific publications on bioRxiv.

SAIR data is a reliable foundation for downstream modeling, screening, and design, as well as benchmarking new models.

A persistent challenge in drug discovery is the “dark proteome,” or disease-associated proteins for which no experimental structure exists. SAIR illuminates these unknown areas by providing a complex of reliable AI predictions, even when experimental data is lacking. For example, more than 40 percent of the proteins in the SAIR dataset have no structures available at all in the Protein Data Bank (PDB), with or without ligand. SAIR addresses one of the biggest challenges with existing AI models: their lack of generality due to lack of data. Using SAIR, scientists can now explore targets previously thought to be untreatable, with structural hypotheses to guide virtual screening and optimization using reliable model predictions.

Additionally, the breadth of SAIR’s cross-targeting reveals polypharmacological patterns and elucidates how a single molecule interacts with multiple proteins. This rich tapestry of interactions can be leveraged to train AI models to predict off-target effects and identify new reuse opportunities, giving organizations a deeper understanding of compound profiles before work begins in the lab.

Access to SAIR

SAIR is available for free at Hugging Face. Here’s a quick guide to getting SAIR from Hugging Face, browsing the main table, and (optionally) downloading some structure archives.

1. Install the essentials

Use Hub to retrieve files and pandas+pyarrow to read Parquet.

pip install huggingface_hub pandas pyarrow

2. Authenticate

Face-hugging authentication:

import hugface_hub hugface_hub.login(token=“Your authentication token”)

3. Load the main table (sair.parquet).

This will retrieve the file from the hub and load it into a DataFrame.

from hug face hub import hf_hub_download
import panda as pd parquet_path = hf_hub_download( repo_id=“Sandbox AQ/SAIR”filename=“Sea. Marquetry”lipotype =“Dataset”
) df = pd.read_parquet(parquet_path) df.head()

4. (Optional) List available structure archives.

Structural files are shipped as a number of .tar.gz archives under structural_compressed/. List them and choose the one you need.

from hug face hub import list_repo_files files = (f.split(“https://huggingface.co/”)(-1) for f in list_repo_files(“Sandbox AQ/SAIR”lipotype=“Dataset”)
if Starts with f.(“Structure_compression/”) and f.endswith(“.tar.gz”)) file(:5)

5. (Optional) Download and extract the structure

Each archive can be large (approximately 10 GB). Download only what you need and extract it locally.

import OS, tar file
from hug face hub import hf_hub_download destination = “sair_structs”
os.makedirs(dest, exist_ok=truth) to_get = (
“sair_structs_1006049_to_1016517.tar.gz”,
“sair_ Structures_100623_to_111511.tar.gz”,)

for name in to_get: tar_path = hf_hub_download( repo_id=“Sandbox AQ/SAIR”filename=f” structure_compression/{name}”lipotype =“Dataset”local_dir=destination, local_dir_use_symlinks=error,)
and tar file.open(tar_path, “r:gz”) as tar: tar.extractall(dest) os.remove(tar_path)

A complete version of this script, including more robust logging and validation, is available in the README file. To learn more, visit the SAIR homepage, read the manuscript on bioRxiv, or watch our 25-minute joint webinar with NVIDIA. There, we will demonstrate SAIR and explain how data is structured within SAIR. Extensive documentation, tutorials, and sample benchmarks are provided to facilitate use and accelerate internal adoption.

The future of drug discovery is data-driven, AI-accelerated, and based on scalable, high-quality structural insights. While there is no AI yet that can design effective drug therapies with prompts alone, SAIR brings researchers closer to that goal with new data and insights that could potentially shave years off even AI-accelerated R&D pipelines.

I can’t wait to see what researchers build with SAIR. SandboxAQ experts support researchers throughout the discovery process.

Have a question?

Please contact the author or post on the SAIR dataset discussion page.

Author: Arman Zaribafyan, Georgia Channing, Zane Beckwith, Rudy Plesch

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleEU AI Adoption 2025: Usage Statistics, Education Trends, Business Integration – News and Statistics
Next Article We co-developed Amazon Alexa: Why I quit building an AI startup, I have no regrets
versatileai

Related Posts

Tools

OpenAI Frontier collides enterprise AI agents with SaaS

March 16, 2026
Tools

Music AI Sandbox adds new features and broader access — Google DeepMind

March 15, 2026
Tools

BMW introduces humanoid robots to manufacturing sites across Europe for the first time

March 14, 2026
Add A Comment

Comments are closed.

Top Posts

G7 skirts are safety discussions for Touchy AI – Politico

June 16, 20256 Views

FermiNet: Quantum physics and chemistry from first principles

February 13, 20255 Views

Music AI Sandbox adds new features and broader access — Google DeepMind

March 15, 20264 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

G7 skirts are safety discussions for Touchy AI – Politico

June 16, 20256 Views

FermiNet: Quantum physics and chemistry from first principles

February 13, 20255 Views

Music AI Sandbox adds new features and broader access — Google DeepMind

March 15, 20264 Views
Don't Miss

OpenAI Frontier collides enterprise AI agents with SaaS

March 16, 2026

Music AI Sandbox adds new features and broader access — Google DeepMind

March 15, 2026

BMW introduces humanoid robots to manufacturing sites across Europe for the first time

March 14, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?