The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text

Researchers tested whether prompt engineering or model selection better improves LLM accuracy on fan experience ratings from baseball survey text. Prompt tweaks yielded only 2 percentage points of gain (67% to 69% accuracy), while GPT-5.2 and GPT-4.1-mini both underperformed the baseline, suggesting diminishing returns on optimization.

Modelwire context

Explainer

The paper's deeper implication is that the bottleneck may not be the model or the prompt at all, but the inherent ambiguity in the source text itself. When open-ended survey responses are vague or emotionally mixed, no amount of model tuning can recover signal that was never there to begin with.

This connects directly to the reliability problems surfaced in our April 16 coverage of 'Diagnosing LLM Judge Reliability,' where researchers found that even when aggregate accuracy looks acceptable, per-instance consistency breaks down badly. That paper was about LLMs judging other LLMs, but the structural problem is the same: aggregate metrics flatter performance while hiding failure at the document level. Taken together, both papers suggest that the field is converging on an uncomfortable finding, that LLM-based text evaluation has a hard ceiling shaped by input quality, not just model capability.

If follow-on work by Hong, Potteiger, or Zapata tests the same methodology on structured or semi-structured survey formats (rating scales paired with open text) and accuracy climbs meaningfully above 70%, that would isolate input ambiguity as the primary constraint rather than model architecture.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGPT-4.1 · GPT-4.1-mini · GPT-5.2 · Hong · Potteiger · Zapata

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.