Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

Researchers have identified a predictable scaling relationship governing how well language models recall factual information, linking performance to both model size and training-data topic frequency through a sigmoid function. The finding, validated across 38 models and 8,900 scholarly references, explains 60-94% of variance in recall quality and suggests factual accuracy is fundamentally gated by a signal-to-noise ratio where concept prevalence acts as signal and model capacity as noise floor. This quantification of the factual-recall scaling law provides practitioners with a framework for predicting hallucination risk and informs decisions about model selection and training-data curation for knowledge-intensive applications.

Modelwire context

Explainer

The more actionable buried finding is directional: topic frequency in training data matters as much as raw model scale, meaning throwing more parameters at a knowledge-sparse domain won't reliably fix hallucination. That reframes data curation as a first-class engineering decision rather than a preprocessing afterthought.

This connects meaningfully to the attention efficiency work we covered in 'DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention' from the same day. DashAttention is trying to reduce compute at inference without sacrificing quality, but this factual-recall paper implies that quality floors are set earlier, during training data composition, not at inference time. The two findings together suggest a division of labor: architectural efficiency work optimizes what a model does with what it knows, while data curation determines the ceiling on what it can know in the first place. That distinction matters for teams deciding where to invest engineering resources.

Watch whether any of the 38 models tested show the sigmoid relationship breaking down at very high topic frequency, which would indicate a saturation regime where additional data stops helping. If that threshold gets quantified in follow-up work, it becomes a concrete budget signal for training data acquisition.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLLMs · scaling laws · factual recall · training data composition · signal-to-noise ratio

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.