Announcing a new suite of open tools for language model interpretability
Although large-scale language models (LLMs) have incredible reasoning power, their internal decision-making processes remain largely opaque. When a system does not behave as expected, it can be difficult to determine the exact reason for the behavior because there is no visibility into its internal workings. Last year, we advanced the science of interpretability with Gemma Scope, a toolkit designed to help researchers understand the inner workings of Gemma 2, a lightweight collection of open models.
Today we are releasing Gemma Scope 2. This is a comprehensive and open suite of interpretation tools for all Gemma 3 model sizes from 270M to 27B parameters. These tools allow you to track potential risks throughout the “brain” of your model.
To our knowledge, this is the largest open source release of an interpretability tool by AI Labs to date. Creating Gemma Scope 2 required storing approximately 110 petabytes of data and training over 1 trillion total parameters.
As AI continues to advance, we hope that the AI
You can try out the interactive Gemma Scope 2 demo, courtesy of Neuronpedia.
New features in Gemma Scope 2
Interpretability research aims to understand the inner workings of an AI model and the learned algorithms. As AI becomes increasingly sophisticated and complex, interpretability is critical to building safe and reliable AI.
Like its predecessor, Gemma Scope 2 acts as a microscope for the Gemma family of language models. Combining a sparse autoencoder (SAE) with a transcoder allows researchers to look inside a model and see what the model is thinking and how those thoughts are formed and connected to the model’s behavior. This enables richer studies of other safety-related AI behaviors, such as jailbreaking and mismatches between a model’s propagated inferences and its internal state.
While the original Gemma Scope enabled research in important safety areas such as model hallucinations, identifying secrets known by models, and training safer models, Gemma Scope 2 supports even more ambitious research through significant upgrades.
Complete coverage at scale: We offer a complete tool suite for the entire Gemma 3 family (up to 27B parameters). This is essential for studying emergent behaviors that only appear at scale, such as those not previously revealed by the 27B-sized C2S scale model, which helped discover new potential cancer treatment pathways. Gemma Scope 2 was not trained on this model, but this is an example of emergent behavior that these tools might be able to understand. More sophisticated tools to decipher complex inner workings: Gemma Scope 2 includes SAE and transcoders trained on all layers of the Gemma 3 family of models. Skip transcoders and cross-layer transcoders facilitate multi-step computations and deciphering algorithms spread across the model. Advanced Training Techniques: We use state-of-the-art techniques, specifically the Matryoshka Training Technique. This allows SAE to discover more useful concepts and resolve specific deficiencies found in Gemma Scope. Chatbot behavior analysis tools: We also provide interpretation tools targeted at versions of Gemma 3 tailored for chat use cases. These tools allow you to analyze complex multi-step behaviors such as jailbreaks, denial mechanisms, and thought chain fidelity.