technology
Published July 31, 2024 Author
Language model interpretation team
We present a comprehensive and open suite of sparse autoencoders for increasing the interpretability of language models.
To create artificial intelligence (AI) language models, researchers build systems that learn from vast amounts of data without human guidance. As a result, the inner workings of language models are often a mystery even to the researchers who train them. Mechanical interpretability is a field of research focused on deciphering these inner workings. Researchers in this field use sparse autoencoders as a kind of “microscope” that allows them to look inside a language model and better understand how it works.
Today we are announcing Gemma Scope. This is a new set of tools to help researchers understand the inner workings of Gemma 2, a lightweight family of open models. Gemma Scope is a collection of hundreds of freely available open sparse autoencoders (SAEs) for Gemma 2 9B and Gemma 2 2B. We’re also open sourcing Mishax, the tool we built to enable much of the interpretability work behind Gemma Scope.
We hope that today’s release will enable more ambitious interpretability studies. Further research could help build more robust systems in the field, develop better safeguards against hallucinations in models, and protect against risks posed by autonomous AI agents, such as deception and manipulation.
Try the interactive Gemma Scope demo brought to you by Neuronpedia.
Interpret what’s happening within the language model
When you ask a language model a question, it turns your text input into a series of “activations.” These activations map relationships between the words you input and help the model create relationships between different words and use them to create answers.
When a model processes text input, activations at different layers of the model’s neural network represent multiple and increasingly sophisticated concepts known as “features.”
For example, early layers of the model might learn to remember facts such as Michael Jordan plays basketball, while later layers might learn to recognize more complex concepts such as the facticity of text. there is.
A stylized representation that uses a sparse autoencoder to interpret the model’s activations, reminding us of the fact that the City of Lights is Paris. It turns out that there are concepts related to French, but no concepts that are unrelated.
However, interpretability researchers face the important problem that model activation is a mixture of many different features. In the early days of mechanistic interpretation, researchers expected the activation characteristics of neural networks to correspond to individual neurons, or nodes of information. But unfortunately, in reality, neurons are active for many unrelated functions. This means there is no clear way to know which features are part of the activation.
This is where sparse autoencoders come into play.
Even though a language model can detect perhaps millions or even billions of features, a given activation is only a mixture of a small number of features. That is, the model uses features sparsely. For example, a language model will consider relativity when answering a question about Einstein and eggs when writing about an omelet, but it probably won’t consider relativity when writing about an omelet.
Sparse autoencoders exploit this fact to discover a set of possible features and split each activation into a small number of features. Researchers expect that the best way for sparse autoencoders to accomplish this task is by finding the actual underlying features used by language models.
Importantly, at no point in this process do we the researchers tell the sparse autoencoder which features to look for. As a result, we are able to discover a wealth of structures that we had not anticipated. However, since the meaning of the detected features is not immediately obvious, the sparse autoencoder looks for meaningful patterns in the text examples when the features are “fired.”
Below is an example where the tokens that trigger the function are highlighted with a blue gradient depending on their strength.
An example of feature activation detected by a sparse autoencoder. Each bubble is a token (a word or word fragment), and the variable blue color indicates how strongly that feature is present. In this case, the feature is clearly related to the idiom.
What’s unique about Gemma Scope?
Previous work using sparse autoencoders has mainly focused on investigating the inner workings of small models or single layers of large models. However, more ambitious interpretability studies involve decoding complex algorithms layered in large models.
To build Gemma Scope, we trained a sparse autoencoder on all layers and sublayer outputs of Gemma 2 2B and 9B, producing over 400 sparse autoencoders with a total of over 30 million learned features (However, many characteristics may overlap). This tool allows researchers to study how features evolve throughout the model and interact and configure to create more complex features.
Gemma Scope is also trained on the new state-of-the-art JumpReLU SAE architecture. The original sparse autoencoder architecture struggled to balance two goals: detecting which features were present and estimating their strength. The JumpReLU architecture makes it easier to get this balance right and significantly reduces errors.
Training so many sparse autoencoders was a significant engineering challenge that required a lot of computing power. Gemma 2 uses approximately 15% of the 9B’s training compute (excluding the compute to generate the distillation labels) and stores approximately 20 Pebibytes (PiB) of activations on disk (approximately 1 million on English Wikipedia). equivalent to a copy), hundreds of total billions of sparse autoencoder parameters.
advance the field
With the release of Gemma Scope, we want to make Gemma 2 the model family of choice for open mechanistic interpretability studies and accelerate community efforts in this field.
So far, the interpretability community has made significant progress in understanding small-scale models with sparse autoencoders and developing related techniques such as causal intervention, automatic circuit analysis, feature interpretation, and evaluation of sparse autoencoders. I have accomplished it. Gemma Scope allows the community to extend these techniques to modern models, analyze more complex features such as thought chains, and tackle problems such as hallucinations and jailbreaks that only occur in larger models. We hope to find real-world applications of interpretability, such as: .
Acknowledgment
Gemma Scope is a collaboration between Tom Lieberum, Sen Rajamanoharan, Arthur Conmy, Lewis Smith, Nic Sonnerat, Vikrant Varma, Janos Kramar, and Neel Nanda, with advice from Rohin Shah and Anca Dragan. Special thanks to Johnny Lin, Joseph Bloom, and Curt Tigges from Neuronpedia for their interactive demos. We would like to thank Phoebe Kirk, Andrew Forbes, Arielle Bier, Aliya Ahmad, Yotam Doron, Tris Warkentin, Ludovic Peran, Kat Black, Anand Rao, Meg Risdal, Samuel Albanie, Dave Orr, Matt Miller, and Alex Turner for their support and contributions. , Tobi Ijitoye, Shruti Sheth, Jeremy See, Tobi Ijitoye, Alex Tomala, Javier Ferrando, Oscar Obeso, Kathleen Keneally, Joe Fernandez, Omar Sanseviero, and Glenn Cameron.