GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

Researchers have released GRUFF, a large-scale benchmark for evaluating how well language models handle pronoun resolution in German, a language with complex grammatical gender and agreement rules absent in English. This work exposes a critical gap in LLM evaluation: existing pronoun fidelity tests rely heavily on English's minimal gender marking, leaving model behavior on morphologically richer languages largely unmeasured. The dataset tests four gender agreement systems and pronoun sets, enabling researchers to disentangle whether reasoning failures or gender bias drives pronoun errors. For practitioners deploying multilingual systems, this reveals potential blind spots in model robustness across typologically diverse languages.

Modelwire context

Explainer

The critical omission in the summary: GRUFF exposes that English-centric pronoun benchmarks have masked systematic failures in morphologically rich languages. Existing evaluations don't distinguish between reasoning errors and gender bias because English lacks the grammatical structure to test that difference. German does.

This connects directly to the entity tracking work from late May ('Do Language Models Track Entities Across State Changes?'), which found that LMs defer computation and aggregate information in parallel rather than incrementally updating state. Pronoun resolution is a form of entity tracking, and GRUFF's finding that models fail on gender agreement suggests the same deferred-computation mechanism may be at play. If models don't track morphological state changes as they process tokens, they'd predictably stumble on languages where gender marking forces that tracking. The broader pattern across these papers is that LLMs have fundamental gaps in maintaining coherent representations across linguistic structure, whether that's entity state, belief updates, or grammatical agreement.

If GRUFF results correlate with performance drops on other morphologically complex languages (Finnish, Polish, Hungarian) tested with equivalent benchmarks, that confirms the gap is structural rather than German-specific. If multilingual model developers report that GRUFF-informed fine-tuning improves cross-lingual robustness on downstream tasks within the next 6 months, the benchmark has real production value.

Coverage we drew on

Do Language Models Track Entities Across State Changes? · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGRUFF · German language models · LLM pronoun fidelity

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.