Research Models & Releases·arXiv cs.CL·May 16

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

Researchers have built ConsumerSimBench, a rigorous evaluation framework that tests whether LLMs can accurately mirror real consumer sentiment patterns rather than generate plausible-sounding reactions. The benchmark uses 1,553 Chinese social media topics decomposed into 23,122 auditable yes-no criteria, achieving 92.1% inter-judge agreement by replacing holistic scoring with granular, verifiable decision points. This work matters because it exposes a gap between LLM fluency and behavioral fidelity, forcing the field to move beyond open-ended generation metrics when using models for opinion simulation and market research. The methodology signals a broader shift toward mechanistic, auditable AI evaluation.

Modelwire context

Explainer

The benchmark is built entirely on Chinese social media data, which means its findings about LLM consumer simulation gaps may reflect culturally specific sentiment patterns rather than universal model limitations. That geographic and linguistic specificity is worth flagging before anyone generalizes the results to Western market research applications.

This is largely disconnected from recent activity in our archive, as we have no prior coverage to anchor it to. It belongs to a growing body of work questioning whether LLMs are reliable proxies for human populations, a concern that spans synthetic data generation, survey simulation, and focus group replacement. The core tension the paper surfaces, that a model can produce fluent, convincing consumer reactions while still failing to reproduce the actual distribution of opinions in a crowd, is the kind of structural limitation that tends to get papered over when vendors pitch LLM-based research tools. ConsumerSimBench's auditable criteria design is a direct response to that problem: it makes failure visible rather than subjective.

Watch whether the benchmark gets adopted or replicated using non-Chinese datasets in the next six to twelve months. If it does not, the methodology may remain a proof of concept rather than a field-wide standard for opinion simulation evaluation.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsConsumerSimBench · LLMs

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.