
The Attentional White Bear Effect in Transformer Language Models
Researchers have uncovered a critical vulnerability in how transformer models handle content suppression: instruction-based filtering successfully prevents prohibited outputs at the surface level, but the underlying concepts remain fully encoded in hidden representations and continue steering model behavior. Using representational probing and attention analysis across multiple architectures, the team demonstrated that suppressed ideas measurably influence downstream generation despite lexical compliance. This finding exposes a fundamental misalignment between behavioral safety measures and actual internal model state, suggesting current suppression techniques create an illusion of control rather than genuine alignment. The persistence across different pooling strategies and model families indicates the problem is structural, not a quirk of specific implementations.68




























