
When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation
Researchers measured how ten LLM architectures respond differently to semantic versus surface-level noise across three major benchmarks, finding that meaning-altering perturbations (paraphrasing, synonyms) shift model outputs 19.7 percentage points more often than formatting changes of equivalent severity. This systematic robustness gap, validated across 1,530 test cases and 11,150 variants with statistical rigor, reveals a fundamental vulnerability in chain-of-thought and ReAct agents: they conflate shallow presentation stability with genuine reasoning consistency. The finding matters for practitioners deploying agents in production, as it suggests current systems lack robust semantic grounding despite appearing stable under cosmetic input variations.62





















