Research Tools & Code·arXiv cs.CL·1d ago

Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

Researchers have released ALMANAC, a dataset designed to address a critical gap in agent collaboration: teaching LLM agents to maintain and align mental models during multi-party work. Current agents optimize for task completion but lack the process-level reasoning needed for genuine partnership. This dataset annotates human collaboration at the action level, capturing how participants track each other's intentions and shared objectives. The work signals growing recognition that scaling agent capability requires moving beyond task metrics toward collaborative competence, reshaping how teams will evaluate and train multi-agent systems.

Modelwire context

Explainer

ALMANAC is not just another collaboration dataset; it's annotated at the action level to capture the reasoning process behind coordination choices, not just task outcomes. This distinction matters because it enables training agents to recognize when teammates' mental models diverge and recover from misalignment mid-task, rather than only optimizing for successful completion.

This work directly operationalizes the evaluation gap that CollabSim identified just days earlier. Where CollabSim proposed methodology to measure collaborative competence, ALMANAC provides the labeled training data that agents need to actually develop it. The two papers together form a feedback loop: CollabSim defines what collaborative competence looks like; ALMANAC gives agents the grounded examples to learn from. This also connects to the broader pattern in recent agent research (AgentCL, COMAP) where the field is shifting from single-task optimization toward systems that adapt, learn, and coordinate across multiple interaction modes.

If teams fine-tune LLM agents on ALMANAC and those agents outperform baselines on CollabSim's evaluation metrics (particularly on shared understanding and misalignment recovery), that confirms the dataset captures genuine collaborative reasoning. If performance gains don't transfer to CollabSim tasks, the annotations may be too task-specific to generalize.

Coverage we drew on

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsALMANAC · LLM agents

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.