
AdaSplash-2: Faster Differentiable Sparse Attention
AdaSplash-2 accelerates differentiable sparse attention for transformers by using histogram-based initialization to compute the normalizer in 1–2 iterations instead of many, reducing computational overhead while maintaining input-dependent sparsity for long-context training.52




























