
Toward Identifiable Sparse Autoencoders
Sparse autoencoders have become central to neural network interpretability work, but a fundamental problem has limited their reliability: training instability causes different runs to produce incompatible concept dictionaries and sparse codes. This paper identifies the architectural and procedural sources of that instability and proposes identifiable SAEs (iSAE), a TopK variant that reduces reconstruction error while improving reproducibility across training runs. The advance matters because interpretability tools that produce inconsistent outputs undermine trust in mechanistic explanations of model behavior, a growing concern as SAEs see wider adoption in safety and alignment research.62






















