Vision-Language Models Suppress Female Representations Under Ambiguous Input

Researchers have identified a critical gap in vision-language model alignment: while these systems suppress demographic bias when gender is explicit, they revert to male defaults on ambiguous inputs, even for female-stereotyped roles. The work introduces LALS, a novel diagnostic tool that maps internal token activations to text embeddings, revealing that biased outputs reflect genuine model encoding rather than surface-level artifacts. This finding matters because real-world imagery is often ambiguous, suggesting current alignment techniques mask rather than resolve underlying gender associations. The technique opens a new avenue for auditing what models actually learn versus what they're trained to say.
Modelwire context
ExplainerThe more pointed finding here is not that VLMs are biased, which is well-documented, but that standard alignment techniques appear to suppress bias conditionally: they work when the model is explicitly cued to apply them, and fail silently when no such cue exists. That conditional suppression is arguably worse than consistent bias, because it creates a false sense of safety in evaluation settings that don't reflect deployment conditions.
The interpretability angle connects directly to 'What Am I Missing? Question-Answering as Hidden State Probing' from the same day, which also argues that internal model states encode information that surface outputs conceal. Both papers are pushing toward the same methodological claim: you cannot trust what a model says about itself, you have to probe what it has actually encoded. That framing also rhymes with the hate speech paper ('Disagreeing Rationales'), which found that evaluation metrics can hide whether models learn robust reasoning or just pattern-match to majority-vote outputs. Taken together, these three papers from the same week suggest a growing consensus that output-level evaluation is structurally insufficient for alignment work.
Watch whether LALS gets adopted as an audit tool in any major model card or third-party evaluation suite within the next six months. If it does, that would confirm the field is moving toward activation-level auditing as a standard rather than a research curiosity. If it stays confined to citations, the methodology may be sound but too costly to operationalize at scale.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsVision-Language Models · LALS (Latent Association Leaning Score)
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.