Language Models Can Resolve Reference Compositionally, But It's Not Their Native Strength: The Case of the Personal Relation Task
A new evaluation framework distinguishes how LLMs handle compositional semantics by separating extensional reasoning (what something refers to) from intensional reasoning (its structured meaning). Testing on the Personal Relation Task reveals that while models can resolve complex nested references like 'Amber's parent's friend', compositional interpretation remains cognitively unnatural for them compared to humans. This finding matters for understanding whether LLMs truly grasp language structure or merely pattern-match, with implications for reliability in tasks requiring systematic semantic decomposition.
Modelwire context
ExplainerThe paper's real contribution is the evaluation framework that separates extensional from intensional reasoning, not just the finding that LLMs struggle with composition. This distinction matters because it lets researchers pinpoint where models fail: at the reference-resolution step or at the structural interpretation step.
This connects directly to the mechanistic work on compositional arithmetic from the same day (arXiv cs.LG, 2026-05-29), which isolated specific circuit-level pathways that enable compositional generalization in transformers. Both papers are asking the same underlying question: do models actually factor complex reasoning into reusable components, or do they just memorize patterns? The Paperno framework provides a behavioral diagnostic; the arithmetic study provides the mechanistic answer. Together they suggest that when composition works in LLMs, it's not because the model has learned a natural decomposition strategy, but because it's reusing internal modules in ways that happen to produce correct outputs.
If Paperno's team applies the extensional/intensional split to the same models tested in the arithmetic paper (small transformers trained on modular tasks), and finds that models with isolated compositional circuits show stronger intensional reasoning than those without, that would confirm composition is mechanistic rather than emergent. If the split shows no correlation, the framework is useful for diagnosis but doesn't explain why composition works when it does.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsLarge Language Models · Personal Relation Task · Paperno
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.