Modelwire
Subscribe

The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text

Illustration accompanying: The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text

Researchers tested whether prompt engineering or model selection better improves LLM accuracy on fan experience ratings from baseball survey text. Prompt tweaks yielded only 2 percentage points of gain (67% to 69% accuracy), while GPT-5.2 and GPT-4.1-mini both underperformed the baseline, suggesting diminishing returns on optimization.

MentionsGPT-4.1 · GPT-4.1-mini · GPT-5.2 · Hong · Potteiger · Zapata

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text · Modelwire