Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

Researchers have identified a critical gap between LLM vocabulary knowledge and actual generation diversity, pinpointing decoding mechanics as the culprit. The Word Coverage Score metric reveals how standard sampling filters like Top-p and Top-k mathematically eliminate contextually valid low-frequency words before they reach the output layer. This work reframes the repetitiveness problem from training data or model architecture to a tractable inference-time issue, suggesting practitioners can recover linguistic variety by tuning sampling parameters rather than retraining. For practitioners optimizing for naturalness and for researchers studying why models underutilize their learned vocabularies, this offers both diagnostic clarity and a path toward immediate improvement.

Modelwire context

Explainer

The paper's sharpest contribution isn't the metric itself but the causal argument: Top-p and Top-k don't just filter noise, they systematically truncate probability mass from low-frequency but contextually valid tokens as a mathematical consequence of how thresholds interact with skewed distributions. That means the problem is reproducible and predictable, not random.

This connects directly to the cultural awareness paper covered the same day ('Probing Cultural Awareness in LLMs'), which found that models fail to produce culturally resonant phrasing even when they demonstrably know the relevant vocabulary. WCS offers a plausible mechanical explanation for part of that gap: the words may exist in the model's learned weights but get filtered before generation. More broadly, the inference-time framing echoes the RAG reading work ('Separating Semantic Competition from Context Length'), which similarly argued that a persistent failure mode traces to a tractable, post-training intervention rather than to fundamental model limitations.

The practical test is whether adjusting Min-p thresholds (the paper's preferred alternative) produces measurable WCS improvements on culturally specific or domain-narrow corpora without increasing factual error rates. If a follow-up benchmark shows that gain, the inference-time framing holds; if error rates climb proportionally, practitioners face a real diversity-accuracy tradeoff that the paper currently underweights.

Coverage we drew on

Probing Cultural Awareness in LLMs: A Case Study of Cross-Culture Aesthetic Stylistics · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsWord Coverage Score · Top-p · Top-k · Min-p

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.