Which tokens does the hybrid model predict better?

📄 Technical report: https://arxiv.org/abs/2606.20936

Which kinds of tokens does the model predict well and which kinds of tokens does it not predict? This question is of particular interest in the case of hybrids, which are beginning to challenge standard transformers and are the language model architecture we are exploring with Olmo Hybrid.

Hybrids can match or even outperform Trances on standard benchmarks, but the headline numbers alone don’t reveal much about what specific advantages hybrid models have over Trances.

To understand these token-level behaviors, we recently conducted an experiment that directly compared our most powerful 7B transformer, the Olmo 3, to a hybrid model, the Olmo Hybrid. Specifically, we provide a detailed comparison of differences in model predictions between different types of tokens, i.e., units of information that appear as input to the LLM.

Olmo 3 and Olmo Hybrid are built to be as similar as possible outside of their architecture, so their data, tokenizers, and training recipes are closely matched, and differences in predictions primarily reflect the architecture itself. By looking at these differences at the token level, we can glean insight into the specific strengths of the hybrid model for trans.

Our results show that hybrid benefits do indeed exist for many tokens, but not all. Olmo hybrids are most powerful for tokens that have meaning, such as nouns, verbs, and adjectives, and tokens that can only be predicted by keeping track of what is going on, such as who a pronoun refers to. But the advantage of hybrids mostly disappears with tokens that simply repeat what’s already in the input (words or phrases reproduced verbatim from before), where the answer is right there and can be searched for. That’s the strength of Transformers.

Measuring the differences between attention and relapse

A language model is built from a stack of repeated layers, each layer using the surrounding tokens to refine the representation of every token.

Trance pays attention to all layers. The model can directly utilize all previous tokens at once and weigh how relevant each is to the current prediction. This allows for accurate recall of attention, even if a particular previous token appeared long before the input. The problem is that every token is compared to every previous token, so the cost of attention increases rapidly as the input increases. Furthermore, although attention is great at remembering and consolidating information, it also struggles to represent information that evolves sequentially over time.

A hybrid model retains some attention layers but replaces the rest with recursive layers. Unlike the attention layer, the recurrent layer reads tokens from left to right, maintains a fixed size of memory, and folds each new token into memory, so the cost of processing each token remains constant as the input gets longer. Because its memory is compressed and lossy, a recursive layer cannot return to previous tokens as accurately as attention. But this is good for continuously recording anything that changes as the model reads the token, providing complementary strength for attention-grabbing.

To isolate the strengths and weaknesses of the attention and repetition layers, we fed Olmo 3 and Olmo Hybrid with passages of text such as articles, Wikipedia entries, books, and scientific papers, as well as structured text such as Python, HTML, and LaTeX. We scored each model based on how accurately it predicted each token from previous tokens in a given sample.

Both models recognized the same previous token and assigned a probability to every possible next token. We recorded the probability each gave to the token that actually followed. We then summarize the differences between the two models on a token-by-token basis by calculating the loss gap, or the difference in loss between the two models. A positive gap means the hybrid predicted the actual next token better. A negative gap means the transformer has gapped.

We performed some analysis to find out where loss gaps might be concentrated. First, we classified each token into categories and averaged the loss gaps within these categories. Because raw means can be distorted by other factors, such as category rarity or how often a token is repeated within a sample of text, we rechecked each pattern using regression, which estimates the effect of the category itself while holding other factors constant.

What the actual text shows

We found that Olmo Hybrid lost less than Olmo 3 for most types of tokens, but the amount of each token was not the same.

In prose, the clearest distinction is between content words, such as nouns, verbs, and adjectives, which carry meaning, and function words, such as “the,” “of,” and “is.” The hybrid predicts content words more accurately than the transformer due to the loss gap before and after. $the gap looks like this: About function words.$

In particular, the advantages of the hybrid model are particularly pronounced for content word categories such as adverbs and adjectives, but some function word categories, such as existential words such as “there,” also show significant advantages for the hybrid model. That is, the hybrid advantage is largest for words that express sentence content and smallest for grammatical words that can be approximately inferred from syntax in any model.

In contrast, we find some specific situations where the advantages of hybrid models over transformers are lost. The first is a closing brace, but not an opening brace. This is a robust pattern across languages, code, and markup brackets. why? It is known that attention is sufficient to represent matching parentheses, which suggests that attention alone is sufficient for predicting closing parentheses.

The second place where the advantage of hybridization almost disappears is when the next token simply repeats something already present in the passage. These cases are identified by searching for repeated N-grams, sequences of text where the token that completes the sequence appears verbatim in the first half of the same passage. The longer the iteration runs, the smaller the hybrid lead becomes, approaching zero.

Finally, inspired by these findings, we consider using losses filtered on specific types of tokens as an evaluation to better compare different architectures in pre-training experiments. We use three 1B parameter models from previous Olmo hybrid work: a transformer, a hybrid, and a pure recurrent model with no attention at all.

For tokens with non-repeating meanings, the hybrid model and the pure recurrent model overtake the transformer, with the hybrid having the best performance. When tokens are repeated, pure recurrent models that do not take care to retrieve copies lag behind both hybrids and transformers.

These filtered token losses thus reveal various fine-grained differences between architectures, such as differences in copy ability or content words, in ways that would otherwise be invisible during the early stages of training.

where does this leave us

Filtered token loss reveals architectural differences during 1B pretraining. Token loss curves at WSD annealed checkpoints for transformer, hybrid, and pure recurrent neural networks (RNNs).

Two lessons emerge from this study.

First, a single overall loss (the average error of the model across all tokens) is too simple to compare transformer and hybrid architectures. Scoring token-only losses, which test the ability of a particular model, reveals important differences.

Second, we find evidence that open-class tokens have special advantages, especially with respect to hybrid models. This is probably related to the state tracking functionality of the RNN layer.

As a next step, we are incorporating these findings into our ongoing hybrid modeling work. We believe that the best hybrid architectures come from a token-by-token understanding of what each component of the model does. We hope that research like this will lead to a deeper understanding of the AI community as a whole.

We encourage you to read the full report, explore Olmo 3, try Olmo Hybrid, and explore related open artifacts.

versatileai

See Full Bio

What's Hot

Which tokens does the hybrid model predict better?

SAP aligns commerce data for AI personalization

Run vLLM server with HF job in one command

SAP aligns commerce data for AI personalization

Run vLLM server with HF job in one command

Introduction to using computers with Gemini 3.5 Flash

KREA 1 Image Model launches with excellent aesthetic controls and custom training for AI art generation | AI News Details

Partnership on AI publishes new case studies from supporters of synthetic media frameworks: Meta, Microsoft, Thorn, Truepic

Harness, scaffolding, and AI agent terminology worth getting right

Most Popular

KREA 1 Image Model launches with excellent aesthetic controls and custom training for AI art generation | AI News Details

Partnership on AI publishes new case studies from supporters of synthetic media frameworks: Meta, Microsoft, Thorn, Truepic

Harness, scaffolding, and AI agent terminology worth getting right

Don't Miss

Which tokens does the hybrid model predict better?

SAP aligns commerce data for AI personalization

Run vLLM server with HF job in one command

Subscribe to Updates

What's Hot

Which tokens does the hybrid model predict better?

Measuring the differences between attention and relapse

What the actual text shows

where does this leave us

Related Posts