Ai2's EMO trains a mixture of experts you can run at one-eighth size
The Allen Institute for AI's EMO writeup tackles a standing weakness of mixture-of-experts models: turn off most of the experts and quality usually falls apart. EMO is a 14-billion-parameter MoE, 1 billion active, with 128 experts total and 8 active, trained on a trillion tokens. The point is not the size but that you can run a 16-expert subset, one-eighth of the model, and lose only about 3 percent of performance, or a 32-expert subset for roughly 1 percent.
The trick is in how it routes. Instead of choosing experts token by token, EMO averages the router's preferences across a whole document and sends every token in that document through the same shared pool. Different documents land on different pools, so coherent expert groups emerge on their own, without anyone labeling domains. Pool sizes are sampled randomly during training so the model learns to work at different subset sizes, and load balancing is applied globally across many documents rather than locally, which is what stops the balancing objective from fighting the modularity. Picking the right experts for a task needs almost no data; a single few-shot example matches full-validation selection. The weights and code are open.
Why it matters
If you serve models under tight memory or cost limits, a MoE that degrades gracefully when shrunk lets you trade quality for size per workload instead of shipping separate models.