Research Models & Releases·arXiv cs.LG·May 25

QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability

Researchers introduce QUIET, a benchmark designed to measure generative rather than discriminative creative ability in large language models. Unlike existing story-completion tests that rely on multiple-choice recognition or subjective rubric scoring, QUIET uses cascaded multi-blank story cloze tasks with explicit content constraints to enable automated, objective evaluation of LLM narrative generation. This addresses a critical gap in LLM evaluation: most benchmarks test whether models can recognize good continuations, not whether they can produce them. The work matters because it could reshape how the field validates creative capabilities, moving beyond proxy metrics toward direct measurement of generation quality.

Modelwire context

Explainer

QUIET's actual innovation isn't the benchmark itself but the cascaded multi-blank constraint mechanism that makes subjective creative evaluation automatable. Most prior work either uses multiple-choice (which doesn't test generation) or open rubrics (which require human judgment). This paper solves a specific technical problem: how to measure whether a model can produce coherent narrative without hiring annotators.

This connects directly to the Creative Quality Alignment work from the same day, which tackled a parallel problem: how do you measure creative quality at scale with minimal annotation? That paper used chain-of-thought fine-tuning on 100 expert examples to surface structural gaps in alignment datasets. QUIET approaches the inverse angle, asking how to build a benchmark that doesn't require expert annotation at all. Together they suggest the field is converging on a recognition that existing creative evaluation methods (subjective rubrics, proxy metrics) don't scale. The Deployment-complete Benchmarking paper also matters here: QUIET's automated scoring only matters if it actually predicts whether models produce usable narrative in practice, not just whether they pass the test.

If QUIET scores correlate with human preference judgments on held-out stories at r > 0.75 across multiple model families, the cascaded constraint approach is viable. If correlation drops below 0.6 when tested on out-of-distribution narrative domains (e.g., dialogue-heavy vs. description-heavy stories), the benchmark is overfitting to its own constraint structure and won't transfer to real creative work.

Coverage we drew on

Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsQUIET · Story Cloze Test · HellaSwag

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.