Pre-training a mix of experts to achieve new modularity

🧠 Model: https://huggingface.co/collections/allenai/emo | 📄 Technical report: https://allenai.org/papers/emo | 💻 Code: https://github.com/allenai/EMO | 📊 Visualization: https://emovisualization.netlify.app/

Today, we are releasing EMO, a new end-to-end pre-trained Mixture of Experts (MoE) model. This allows modular structure to emerge directly from the data without relying on human-defined priors. EMO allows you to use a small subset of experts (only 12.5% of the total) for specific tasks, while maintaining the performance of an almost full model, and serves as a powerful general-purpose model even when all experts are used together.

Large language models are typically trained and deployed as monolithic systems. A single model is initialized, pre-trained, fine-tuned, and functions as one unified entity. However, applications often require only a subset of functionality, such as code generation, mathematical reasoning, and domain-specific knowledge. Frontier language models routinely reach trillions of parameters, making using and adapting a complete model impractical for most users and incurring unnecessary computational costs and memory to host parameters that may not even be needed.

Mixed Expertise (MoE) models seem like a natural way to relax this constraint. Instead of using one large feedforward network at each layer, MoE contains many smaller feedforward networks called experts, which activate only a small subset for each input token. In principle, for tasks that require only one feature, you can load only the relevant experts.

However, in reality, existing educational institutions still require a complete model to function properly. Even within a single input, different tokens often activate different experts, so all experts may end up being used during task generation. As we have shown in our paper, part of the reason this occurs is because experts in standard education institutions often specialize in low-level lexical patterns, such as prepositions and punctuation, rather than higher-level domains or functions. As a result, a small number of experts is not reliable.

Instead, we need a MoE model that is organized into coherent groups that can be selectively used and configured by experts.

One way to facilitate this during pre-training is to route tokens to experts based on predefined semantic domains, such as math, biology, or code. Previous studies, such as BTX and our FlexOlmo project, have attempted to do this. However, predefined domains have important limitations. Domain labels are required for the entire pre-training corpus, which can be vague and expensive to acquire, and can introduce too much human bias into the way the model itself is organized. More importantly, prefixing the domain also fixes the modular structure of the model. When new domains or features emerge during inference, it is not obvious which expert should be used.

That’s where EMO comes in.

We show that EMO, a 1B active, 14B total parameter (8 active experts, 128 total experts) MoE trained on 1 trillion tokens, supports the use of selective experts. For a given task or domain, only a small subset of experts (only 12.5% of all experts) can be used while maintaining performance close to the full model. At the same time, EMO remains a powerful general-purpose model when used by all experts together. In contrast, a standard MoE with a comparable architecture trained on the same data shows severe degradation when selectively using its expert subset.

EMO is a MoE trained with modularity as a primary goal. For a given domain (mathematics, code, biomedical, etc.), users can select a small subset of experts of any size and maintain the performance of a near-full model. This turns a single model into a composable architecture, allowing flexible deployment with improved memory accuracy trade-offs for large and sparse MoEs.

How can we achieve modularity?

In MoE, a small network called routers determines which expert activates each token. We want the router to learn that tokens from similar domains should activate a similar subset of experts. Our key observation is that tokens for the same document typically come from the same domain. Therefore, we use document boundaries as weak monitoring signals. During training, all tokens in the document are restricted to selecting the active expert from the shared expert pool.

Comparison of standard MoE and EMO training (k = 2, n = 10, shared experts omitted for simplicity). (Left) In a standard MoE, each token independently selects the top k experts. All experts are used throughout the token. (Right) In EMO, the router first selects a subset of experts for each document, and all tokens are constrained to be routed within this subset. This forces consistent use of experts throughout the document and encourages groups of experts to form areas of expertise.

For example, in a MoE with a total of 10 experts and 2 active experts per token, as shown in the diagram above, all tokens in the document will be restricted to route within the same pool of 4 experts. This pool is selected by the router itself. It averages the router’s expert settings across all tokens in the document and selects the most commonly used experts as the document’s shared pool. Different pools can be used for different documents, allowing repeated groups of experts to emerge directly from the training data.

There are several considerations when implementing the system.

Load balancing. One of the technical challenges is load balancing. Standard MoE training uses a load balancing objective to prevent the model from focusing on only a few experts. At first glance, this appears to be inconsistent with EMO’s training goals. We explicitly limit each document to use by only a subset of experts.

Contention is due to the scale at which load balancing is typically applied. In many MoE implementations, load distribution is computed locally, often within microbatches containing only a small number of documents. This local purpose could potentially spread tokens across many experts within the same document, directly contradicting EMO’s objective of maintaining consistency in expert usage within a document.

To solve this, we apply load balancing globally across many documents. At such a large scale, the two objectives are complementary. EMO encourages tokens within the same document to use a consistent pool of experts, while global load balancing encourages different documents to collectively cover all experts. In fact, we found that global load balancing is important for stable training.

Document pool size. The size of the document pool controls how restrictive the modularity constraint is. A smaller pool forces tokens within the same document to share a tighter set of experts, promoting stronger modularity. A larger pool makes the model more flexible, but less constrained.

Rather than fixing one pool size, sample it randomly during training. This prevents EMO from overfitting to a single subset size and allows it to support different expert subset sizes during inference.

Benchmark results

In generic benchmarks, EMO is comparable to the performance of the standard MoE model, indicating that the modularization goal is not achieved at the expense of full model performance. But the more important question is whether the model will work even if you retain only a portion of the experts. This setup builds a task-specific subset of experts by ranking experts according to their routing usage on a small amount of task validation data, keeping the most frequently used experts, and discarding the rest.

The figure below shows that EMO remains robust even under selective use by experts. By retaining only 25% of the experts (a subset of 32 experts), EMO only loses about 1% in absolute performance across all benchmarks. Even if we retain only 12.5% of the experts (a subset of 16 experts), the overall reduction is only about 3%. This applies both before and after tweaking. In contrast, the matching standard MoE decreases rapidly as the expert subset becomes smaller, often approaching or falling below random performance for the smallest expert subset setting.

Moreover, it shows that choosing the right expert for the task is surprisingly low cost. One example with a few shots of the demo is sufficient to identify modules that perform as well as the selected module using the complete validation set. Additionally, EMO is not tied to any particular selection method. It works well with existing expert pruning approaches such as Easy-EP, and the two complement each other.

Reduce the 130B token setting. Average performance for 16 MMLU categories across different memory budgets. A subset of EMO experts pushes the Pareto frontier in memory accuracy tradeoffs, outperforming standard MoEs and even fixed-budget models trained from scratch.

What does the expert subset specialize in?

To see what EMO actually learned after training, we clustered router activations for the first 100 tokens across the 12,000 pre-training documents. The difference from standard MoE is obvious.

EMO’s token clusters cover health, medical and wellness, news reporting, US politics and elections, movies and music, and more. Standard MoE generates clusters for prepositions, proper names, copula verbs, definite articles, etc. In EMO, tokens from a particular document are mostly placed in the same cluster. In a standard MoE, they would be spread out over many locations.

Contrast is easiest to see with a single example. Let’s take a look at articles related to health. At EMO, almost all tokens are routed to the health, medical, and wellness cluster. In a standard MoE, the top-level clusters are possessions and firm articles. The model groups the article with all other texts that happen to use the words the or your, regardless of the content of the text.

Token clusters for MoE pre-training data trained on 1T tokens. EMO clusters correspond to semantically meaningful domains, where tokens from the same document are roughly grouped together. Standard MoE training uses document tokens distributed across multiple clusters to generate clusters of surface-level or syntactic features.

Because EMO forms modules that map to the semantic domain rather than superficial features, it is possible to select a small subset of experts and still maintain a working model. Groups correspond to actual functions.

You can try out the clustering results yourself with interactive visualizations.

what we release

We will release a complete model trained on EMO, a matched standard MoE baseline trained on the same data, and the training code. We hope that these artifacts will be useful to other groups researching emergent modularity in educational institutions.

There is still work to be done. Although EMO is an early step toward making large sparse models more modular, many questions remain, including how to better select and configure expert subsets, how to update modules without disrupting the entire model, and how to use modular structures to improve interpretability and control. Releasing these models should help the community explore these questions and build toward modular language models that are easy to deploy, adapt, inspect, and compose.

versatileai

See Full Bio

What's Hot

Physical AI Conference Held in San Jose as Robotics and Autonomous AI Go Mainstream

JBS Dev: About incomplete data and the last mile of AI – from model capabilities to cost sustainability

AI automates HR compliance except where tech companies need it

Physical AI Conference Held in San Jose as Robotics and Autonomous AI Go Mainstream

JBS Dev: About incomplete data and the last mile of AI – from model capabilities to cost sustainability

AI automates HR compliance except where tech companies need it

How Prezi leverages hubs and expert support programs to accelerate your ML roadmap

OpenAI blocks Sora from creating MLK video after Estate object

SNS Network Project Increases GPUAAS Business and Server Sales, Expanding AI Adoption

Most Popular

How Prezi leverages hubs and expert support programs to accelerate your ML roadmap

OpenAI blocks Sora from creating MLK video after Estate object

SNS Network Project Increases GPUAAS Business and Server Sales, Expanding AI Adoption

Don't Miss

Physical AI Conference Held in San Jose as Robotics and Autonomous AI Go Mainstream

JBS Dev: About incomplete data and the last mile of AI – from model capabilities to cost sustainability

AI automates HR compliance except where tech companies need it

Subscribe to Updates

What's Hot

Pre-training a mix of experts to achieve new modularity

How can we achieve modularity?

Benchmark results

What does the expert subset specialize in?

what we release

Related Posts