Research Tools & Code·arXiv cs.LG·4d ago

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE addresses a critical scaling bottleneck in mixture-of-experts architectures by enabling hyperparameter transfer across dense and sparse MoE configurations. Prior methods like μP and SDE fail when model topology or token distribution shifts, forcing practitioners to retune from scratch at each scale. This framework's two-bridge approach decouples architecture changes from optimization dynamics, allowing transfer rules to propagate across orders-of-magnitude scaling. For teams training large MoE models, this cuts experimentation cycles and reduces compute waste during architecture exploration, directly impacting training efficiency at scale.

Modelwire context

Explainer

The buried detail is that the 'two-bridge' structure isn't just a convenience, it's a necessity: architecture changes and token-routing changes break hyperparameter stability through distinct mechanisms, and prior frameworks like muP only addressed one axis. Treating them as separable problems is the actual technical contribution.

This pairs interestingly with the Shannon-theoretic scaling piece published the same day ('LLMs as Noisy Channels'), which argues that conventional power-law scaling assumptions obscure real capacity ceilings tied to signal-to-noise dynamics. Complete-muE operates at a different layer, the optimization and architecture search layer rather than the capacity modeling layer, but both papers are pushing against the same practical failure mode: compute wasted because practitioners lack principled tools to predict behavior before committing to a full training run. Together they sketch a picture of a field trying to make large-scale training less empirically chaotic and more theoretically grounded.

The real test is whether a major lab publishes MoE training results that explicitly credit Complete-muE transfer rules within the next six months. Adoption in a production training report would confirm the framework holds outside controlled ablations; silence would suggest the transfer guarantees degrade under real-world routing noise.

Coverage we drew on

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsComplete-muE · μP · SDE · Mixture-of-Experts · transformer

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.