Research Models & Releases·arXiv cs.LG·Apr 20

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Researchers benchmarked cloud and open-source LLMs on system dynamics tasks, finding cloud models hit 77-89% accuracy on causal diagram extraction while the best local model (Kimi K2.5) matched mid-tier cloud performance. Local models struggled with error-fixing in interactive coaching scenarios, revealing a gap in long-context reasoning.

Modelwire context

Explainer

The benchmark targets a genuinely narrow professional use case: helping practitioners build and critique causal loop diagrams, a core tool in policy modeling and organizational analysis. The finding that local models break down specifically during iterative error-correction, rather than initial extraction, points to a failure mode in multi-turn reasoning under constraint, not just raw accuracy.

This connects directly to the reliability problems surfaced in 'Diagnosing LLM Judge Reliability' (arXiv, April 16), which found that aggregate consistency scores mask per-instance logical breakdowns in roughly one-third to two-thirds of cases. The coaching scenario failures described here look like the same underlying issue: models that appear coherent at the task level but lose track of constraints across turns. The 'Generalization in LLM Problem Solving' paper from the same week adds another angle, showing that LLMs fail when problems require recursive depth, which is structurally similar to iterative diagram correction. Together, these papers suggest the cloud-versus-local gap is less about raw capability and more about sustained structured reasoning over longer interaction chains.

If Kimi K2.5 or a comparable open-weight model closes the error-correction gap on a follow-up interactive coaching benchmark within the next two quarters, that would confirm the bottleneck is addressable through fine-tuning on domain-specific feedback loops rather than a fundamental context-length ceiling.

Coverage we drew on

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsKimi K2.5 · CLD Leaderboard · Discussion Leaderboard

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.