Research Tools & Code·arXiv cs.LG·May 15

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

FORGE introduces a population-based protocol that improves LLM agent reasoning by evolving natural-language memory artifacts without gradient updates. The system uses a reflection agent to convert failed trajectories into reusable heuristics and demonstrations, then propagates top-performing memory across a population between training stages. This approach sidesteps the need for model distillation or fine-tuning, suggesting a scalable path for agents to bootstrap their own knowledge. The work challenges assumptions about how agents must learn, potentially reshaping how teams build reasoning systems that improve through self-reflection rather than retraining.

Modelwire context

Explainer

The key distinction FORGE draws is not just that it avoids fine-tuning, but that it treats memory as a broadcast artifact: a reflection agent distills failure into reusable heuristics, and those heuristics propagate laterally across a population of agents rather than being locked to the individual that generated them. That population dynamic is what separates this from prior self-reflection work like Reflexion, where learned corrections stay local.

The layer redundancy paper covered the same day ('Layer Equivalence Is Not a Property of Layers Alone') is a useful counterpoint here: that work shows how compression decisions based on weight-level analysis can be systematically wrong, which makes weight-free improvement paths like FORGE more attractive to teams already skeptical of distillation pipelines. More broadly, FORGE belongs to a cluster of research asking whether behavioral improvement can be decoupled from parameter updates entirely. That question does not connect strongly to the watermarking or utility billing coverage in the archive, which address different deployment concerns.

The benchmark used, CybORG CAGE-2, is a narrow adversarial network defense task. If FORGE's memory propagation gains replicate on open-ended reasoning benchmarks like GPQA or AgentBench within the next two quarters, the weight-free learning claim earns broader credibility. If results stay confined to structured game environments, the scope of the contribution narrows considerably.

Coverage we drew on

Layer Equivalence Is Not a Property of Layers Alone: How You Test Redundancy Changes What You Find · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsFORGE · Reflexion · ReAct · CybORG CAGE-2 · LLM agents

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.