
MoRFI: Monotonic Sparse Autoencoder Feature Identification
Researchers have identified specific latent directions within fine-tuned LLMs that causally drive hallucinations when models are trained on new factual knowledge. Using controlled experiments across Llama 3.1, Gemma 2, and Mistral, the team isolated how supervised fine-tuning introduces factual errors despite improving task performance. This mechanistic finding matters because it bridges the gap between observing hallucination problems and understanding their root cause, potentially enabling targeted interventions during post-training rather than broad architectural changes. For practitioners deploying fine-tuned models in production, this work suggests hallucinations aren't inevitable side effects but addressable phenomena tied to specific learned features.62




























