Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Which tokens does the hybrid model predict better?

June 27, 2026

SAP aligns commerce data for AI personalization

June 26, 2026

Run vLLM server with HF job in one command

June 26, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Saturday, June 27
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Which tokens does the hybrid model predict better?
Tools

Which tokens does the hybrid model predict better?

versatileaiBy versatileaiJune 27, 2026No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

๐Ÿ“„ Technical report: https://arxiv.org/abs/2606.20936

The draft of the hybrid token prediction blog will also be published on Hugging Face - Goog-image-1

Which kinds of tokens does the model predict well and which kinds of tokens does it not predict? This question is of particular interest in the case of hybrids, which are beginning to challenge standard transformers and are the language model architecture we are exploring with Olmo Hybrid.

Hybrids can match or even outperform Trances on standard benchmarks, but the headline numbers alone don’t reveal much about what specific advantages hybrid models have over Trances.

To understand these token-level behaviors, we recently conducted an experiment that directly compared our most powerful 7B transformer, the Olmo 3, to a hybrid model, the Olmo Hybrid. Specifically, we provide a detailed comparison of differences in model predictions between different types of tokens, i.e., units of information that appear as input to the LLM.

Olmo 3 and Olmo Hybrid are built to be as similar as possible outside of their architecture, so their data, tokenizers, and training recipes are closely matched, and differences in predictions primarily reflect the architecture itself. By looking at these differences at the token level, we can glean insight into the specific strengths of the hybrid model for trans.

Our results show that hybrid benefits do indeed exist for many tokens, but not all. Olmo hybrids are most powerful for tokens that have meaning, such as nouns, verbs, and adjectives, and tokens that can only be predicted by keeping track of what is going on, such as who a pronoun refers to. But the advantage of hybrids mostly disappears with tokens that simply repeat what’s already in the input (words or phrases reproduced verbatim from before), where the answer is right there and can be searched for. That’s the strength of Transformers.

Measuring the differences between attention and relapse

A language model is built from a stack of repeated layers, each layer using the surrounding tokens to refine the representation of every token.

Trance pays attention to all layers. The model can directly utilize all previous tokens at once and weigh how relevant each is to the current prediction. This allows for accurate recall of attention, even if a particular previous token appeared long before the input. The problem is that every token is compared to every previous token, so the cost of attention increases rapidly as the input increases. Furthermore, although attention is great at remembering and consolidating information, it also struggles to represent information that evolves sequentially over time.

A hybrid model retains some attention layers but replaces the rest with recursive layers. Unlike the attention layer, the recurrent layer reads tokens from left to right, maintains a fixed size of memory, and folds each new token into memory, so the cost of processing each token remains constant as the input gets longer. Because its memory is compressed and lossy, a recursive layer cannot return to previous tokens as accurately as attention. But this is good for continuously recording anything that changes as the model reads the token, providing complementary strength for attention-grabbing.

To isolate the strengths and weaknesses of the attention and repetition layers, we fed Olmo 3 and Olmo Hybrid with passages of text such as articles, Wikipedia entries, books, and scientific papers, as well as structured text such as Python, HTML, and LaTeX. We scored each model based on how accurately it predicted each token from previous tokens in a given sample.

Both models recognized the same previous token and assigned a probability to every possible next token. We recorded the probability each gave to the token that actually followed. We then summarize the differences between the two models on a token-by-token basis by calculating the loss gap, or the difference in loss between the two models. A positive gap means the hybrid predicted the actual next token better. A negative gap means the transformer has gapped.

We performed some analysis to find out where loss gaps might be concentrated. First, we classified each token into categories and averaged the loss gaps within these categories. Because raw means can be distorted by other factors, such as category rarity or how often a token is repeated within a sample of text, we rechecked each pattern using regression, which estimates the effect of the category itself while holding other factors constant.

What the actual text shows

Hybrid token predictive social copy - Google Docs-image-2

We found that Olmo Hybrid lost less than Olmo 3 for most types of tokens, but the amount of each token was not the same.

In prose, the clearest distinction is between content words, such as nouns, verbs, and adjectives, which carry meaning, and function words, such as “the,” “of,” and “is.” The hybrid predicts content words more accurately than the transformer due to the loss gap before and after. 0.040.040.04the gap looks like this: 0.020.020.02 About function words.

In particular, the advantages of the hybrid model are particularly pronounced for content word categories such as adverbs and adjectives, but some function word categories, such as existential words such as “there,” also show significant advantages for the hybrid model. That is, the hybrid advantage is largest for words that express sentence content and smallest for grammatical words that can be approximately inferred from syntax in any model.

In contrast, we find some specific situations where the advantages of hybrid models over transformers are lost. The first is a closing brace, but not an opening brace. This is a robust pattern across languages, code, and markup brackets. why? It is known that attention is sufficient to represent matching parentheses, which suggests that attention alone is sufficient for predicting closing parentheses.

Hybrid token predictive social copy - Google Docs-image-3

The second place where the advantage of hybridization almost disappears is when the next token simply repeats something already present in the passage. These cases are identified by searching for repeated N-grams, sequences of text where the token that completes the sequence appears verbatim in the first half of the same passage. The longer the iteration runs, the smaller the hybrid lead becomes, approaching zero.

Finally, inspired by these findings, we consider using losses filtered on specific types of tokens as an evaluation to better compare different architectures in pre-training experiments. We use three 1B parameter models from previous Olmo hybrid work: a transformer, a hybrid, and a pure recurrent model with no attention at all.

For tokens with non-repeating meanings, the hybrid model and the pure recurrent model overtake the transformer, with the hybrid having the best performance. When tokens are repeated, pure recurrent models that do not take care to retrieve copies lag behind both hybrids and transformers.

These filtered token losses thus reveal various fine-grained differences between architectures, such as differences in copy ability or content words, in ways that would otherwise be invisible during the early stages of training.

where does this leave us

Hybrid token predictive social copy - Google Docs-image-4

Filtered token loss reveals architectural differences during 1B pretraining. Token loss curves at WSD annealed checkpoints for transformer, hybrid, and pure recurrent neural networks (RNNs).

Two lessons emerge from this study.

First, a single overall loss (the average error of the model across all tokens) is too simple to compare transformer and hybrid architectures. Scoring token-only losses, which test the ability of a particular model, reveals important differences.

Second, we find evidence that open-class tokens have special advantages, especially with respect to hybrid models. This is probably related to the state tracking functionality of the RNN layer.

As a next step, we are incorporating these findings into our ongoing hybrid modeling work. We believe that the best hybrid architectures come from a token-by-token understanding of what each component of the model does. We hope that research like this will lead to a deeper understanding of the AI โ€‹โ€‹community as a whole.

We encourage you to read the full report, explore Olmo 3, try Olmo Hybrid, and explore related open artifacts.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleSAP aligns commerce data for AI personalization
versatileai

Related Posts

Tools

SAP aligns commerce data for AI personalization

June 26, 2026
Tools

Run vLLM server with HF job in one command

June 26, 2026
Tools

Introduction to using computers with Gemini 3.5 Flash

June 25, 2026
Add A Comment

Comments are closed.

Top Posts

KREA 1 Image Model launches with excellent aesthetic controls and custom training for AI art generation | AI News Details

June 16, 20254 Views

Partnership on AI publishes new case studies from supporters of synthetic media frameworks: Meta, Microsoft, Thorn, Truepic

November 21, 20244 Views

Harness, scaffolding, and AI agent terminology worth getting right

May 27, 20263 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

KREA 1 Image Model launches with excellent aesthetic controls and custom training for AI art generation | AI News Details

June 16, 20254 Views

Partnership on AI publishes new case studies from supporters of synthetic media frameworks: Meta, Microsoft, Thorn, Truepic

November 21, 20244 Views

Harness, scaffolding, and AI agent terminology worth getting right

May 27, 20263 Views
Don't Miss

Which tokens does the hybrid model predict better?

June 27, 2026

SAP aligns commerce data for AI personalization

June 26, 2026

Run vLLM server with HF job in one command

June 26, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?