AMEL: Accumulated Message Effects on LLM Judgments

Researchers tested whether LLM evaluators drift toward the sentiment of prior conversation history, finding statistically significant bias across 11 models from four major providers. When models reviewed identical items after exposure to predominantly positive or negative evaluation sequences, judgments shifted measurably in the direction of that polarity, with the effect strongest on genuinely ambiguous cases. This finding matters for production deployments where LLMs batch-score code, content, or outputs in sequence, revealing a systematic vulnerability in automated evaluation pipelines that organizations currently treat as objective.

Modelwire context

Explainer

The research isolates a specific mechanism worth naming: this isn't general inconsistency or randomness in LLM judgment, it's directional drift tied to prior polarity, which means the order you feed items into an evaluation pipeline systematically skews results in a predictable direction. That predictability makes it exploitable, not just unreliable.

The vulnerability sits directly inside the post-training dynamics covered in 'Post-Training is About States, Not Tokens' from the same day. That paper argues the distribution of states a model is exposed to during training shapes its behavior as consequentially as the loss objective itself. AMEL's findings suggest an analogous effect persists at inference: the distribution of prior messages in a session acts as an implicit conditioning signal, bending outputs toward whatever polarity the context has established. The mechanistic interpretability work on GPT-2 activations ('Reading Task Failure Off the Activations') adds a complementary lens, showing that specific learned features can dominate model outputs in ways aggregate benchmarks miss entirely. Together, these papers sketch a picture where evaluation pipelines face compounding blind spots.

Watch whether any of the four providers named in the study, OpenAI, Anthropic, Google, or the fourth, acknowledge the finding and publish mitigation guidance within the next two quarters. Silence would confirm that production evaluation tooling remains exposed without any official acknowledgment of the risk.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsOpenAI · Anthropic · Google · AMEL

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.