Achieve density and score across distributions with one transformer

📄 Technical report: arxiv.org/abs/2511.05924

Many problems in machine learning and science boil down to the same task. In other words, we have a collection of data points and we want to recover the distribution they come from, i.e. which values are common and which are rare. Determining its distribution means estimating two quantities. One is the density of the distribution, the other is the density of the distribution, and the score becomes more useful as the dimensionality increases. Density is a smoother version of a histogram, higher when points are closer together and lower when there are fewer points. The score (log density slope) refers to the direction in which density increases fastest. As you move the points along the score, you move towards more likely areas.

Diffusion-based generative models (the technology behind AI image generators like Stable Diffusion and DALL-E) start with random noise and turn that noise into realistic images according to an iterative score. The same score drives Bayesian sampling and particle simulations used to model systems such as plasmas.

Extracting density and scores from finite samples is difficult, and today’s tools force trade-offs between generalizability and accuracy. One classic approach, kernel density estimation (KDE), calculates density from data points around any location. The closer and more numerous the data points are, the higher the density. It requires no training and can be applied to any distribution, but accuracy decreases rapidly as dimensionality increases. Alternatively, neural score matching models trained to predict scores remain accurate even in high dimensions, but each must learn the distribution, and different models must be retrained from scratch.

We introduce a new solution called DiScoFormer (Density and Score Transformer). This is one of those models that, given a set of data points, estimates both the density and score of a distribution in a single forward pass without retraining.

Train a transformer for density and score estimation

DiScoFormer uses stacked layers of transblocks to map the entire sample to the density and score of the underlying distribution. This model utilizes cross-attention, so you can evaluate density and score at any point in time, not just where the data is. There is a mathematical relationship between score and density. The score is the slope of the logarithm of the density. We exploit this by having a shared backbone with two output heads, one for density and one for score.

This binding does more than just store parameters. Since the score head must match the slope of the log density head for each query, gaps between them lead to label-free inconsistency. Use this during inference. Fix the context and perform some gradient steps on its consistency loss. DiScoFormer then adapts to out-of-distribution input on the fly, without the need for ground truth density or scores.

There are mathematical reasons why transformer architecture is suited to this task. The kernel density estimation has a single bandwidth, and how far the influence of each point reaches is fixed in advance and applied equally everywhere. Attention is a strict generalization of that. Since we analytically show that the weights of a single attention head are approximately Gaussian kernels across the data, one cross-attention block can already reproduce the density and score of KDE. From there, the model goes further, learning multiple such scales at once and adapting them to the data. DiScoFormer does not abandon the classic black box approach, but instead incorporates and improves on KDE as a special case.

What data did you use to train DiScoFormer? We relied on a Gaussian mixture model for two main reasons. First, the GMM is a general-purpose density approximator with enough components to match essentially any smooth distribution to arbitrarily small errors. Second, GMM has a closed-form density and score, so there is always a precise target to monitor. We employ both of these properties by drawing a new GMM every batch, giving the model virtually unlimited examples of target distributions, and monitoring each for the exact density and score of a given GMM.

performance

Overall, DiScoFormer outperformed KDE in both density and score estimation, widening the gap in the very areas where KDE struggles. In 100 dimensions, this is not even close. Compared to the best manually tuned KDE, it reduces score error by about 6.5x and density error by over 37x, and continues to improve as you add samples while KDE is out of memory. It also moves far beyond the range of the training data and maintains accuracy with larger mixtures of modes than previously seen during training, as well as non-Gaussian shapes such as Laplace and Student’s t. The main advantage of KDE is speed, especially when the dataset is small.

What we find most exciting about DiScoFormer is that score estimation is a dependency shared across many fields, including generative modeling, Bayesian inference, and scientific computing. You can reduce costs all at once with pre-trained plugin estimators that maintain accuracy in high dimensions and eliminate the need to retrain for each problem. One model is reused everywhere scores and densities are displayed.

For more information, we recommend reading our technical report.

versatileai

See Full Bio

What's Hot

Start building with Nano Banana 2 Lite and Gemini Omni Flash

Wimbledon adds IBM AI tools for live match coverage

Achieve density and score across distributions with one transformer

Start building with Nano Banana 2 Lite and Gemini Omni Flash

Wimbledon adds IBM AI tools for live match coverage

HP accelerates enterprise workflows with OpenAI Frontier

Top 5 NSFW AI Generators for Surreal NSFW AI Art in 2025

Practical 3D Asset Generation: A Step-by-Step Guide

Shutterstock pioneers “research license” model with Lightricks, lowering barriers to AI training data

Most Popular

Top 5 NSFW AI Generators for Surreal NSFW AI Art in 2025

Practical 3D Asset Generation: A Step-by-Step Guide

Shutterstock pioneers “research license” model with Lightricks, lowering barriers to AI training data

Don't Miss

Start building with Nano Banana 2 Lite and Gemini Omni Flash

Wimbledon adds IBM AI tools for live match coverage

Achieve density and score across distributions with one transformer

Subscribe to Updates

What's Hot

Achieve density and score across distributions with one transformer

Train a transformer for density and score estimation

performance

Related Posts