
HORST: Composing Optimizer Geometries for Sparse Transformer Training
Transformer sparsification has hit a fundamental wall: standard optimizers cannot simultaneously push models toward sparsity and keep training stable. Adaptive methods naturally favor L-infinity geometry (stability), while sparsity demands L-1 bias. HORST solves this by composing optimizer steps as non-commutative operators, using hyperbolic mirror maps to inject sparsity pressure without sacrificing convergence. The result is a modular optimizer that works across vision and language tasks. For practitioners scaling transformers, this addresses a real bottleneck in efficient model deployment, bridging the gap between theoretical sparsity and practical training robustness.62






















