When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

Illustration accompanying: When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

Researchers measured how ten LLM architectures respond differently to semantic versus surface-level noise across three major benchmarks, finding that meaning-altering perturbations (paraphrasing, synonyms) shift model outputs 19.7 percentage points more often than formatting changes of equivalent severity. This systematic robustness gap, validated across 1,530 test cases and 11,150 variants with statistical rigor, reveals a fundamental vulnerability in chain-of-thought and ReAct agents: they conflate shallow presentation stability with genuine reasoning consistency. The finding matters for practitioners deploying agents in production, as it suggests current systems lack robust semantic grounding despite appearing stable under cosmetic input variations.

Modelwire context

Explainer

The study's most underreported implication is directional: agents appear robust because surface-level stability is easy to achieve, and that appearance actively masks a deeper failure to anchor reasoning in meaning. The 19.7 percentage point gap is not just a performance number but a diagnostic of where current training signals are pointing models.

This finding lands directly on top of the benchmark integrity problem covered in 'Automated Benchmark Auditing for AI Agents and Large Language Models' from the same week. That piece showed that over a quarter of frontier benchmarks contain structural defects, meaning published robustness scores may already be measuring brittle proxies. This study adds a second layer: even on well-formed benchmarks, the metric being optimized (output stability) may not reflect the property practitioners actually care about (semantic grounding). Together, the two papers describe a compounding measurement problem, where both the benchmarks and the robustness criteria applied to them can mislead simultaneously.

Watch whether Qwen or any of the other nine tested architectures release targeted fine-tuning runs that close the semantic robustness gap on GSM8K or MATH without degrading surface stability scores. If the gap narrows on one axis but widens on the other, that confirms the two properties are in tension and not jointly optimizable with current training recipes.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGSM8K · MATH · HotpotQA · Qwen · chain-of-thought agents · ReAct agents

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.