Research Tools & Code·arXiv cs.CL·1d ago

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

CoEval addresses a critical pain point in model selection: benchmark contamination has made public leaderboards unreliable proxies for real-world performance. This framework generates task-specific evaluation sets on-the-fly from task descriptions alone, then uses an ensemble judge to rank models without human annotation. The approach sidesteps both data scarcity and the memorization problem that has hollowed out standard benchmarks, achieving 0.86 correlation with ground truth where validation is possible. For practitioners choosing models for niche domains, this shifts evaluation from trust-the-leaderboard to reproducible, contamination-free ranking.

Modelwire context

Analyst take

The 0.86 correlation figure is promising but comes with a quiet caveat: it only applies where ground truth validation was possible, meaning the hardest cases (truly novel domains with no labeled data at all) remain unverified. That gap between the headline number and the actual coverage of the claim deserves more scrutiny than the summary gives it.

This lands directly on top of the Amazon internal leaderboard story from June 1st, where employee gaming forced a shutdown and exposed how fragile competitive evaluation structures are under real organizational pressure. CoEval's contamination-free framing is essentially a technical answer to the same institutional problem Amazon hit socially. It also connects to K-BrowseComp's finding that public benchmark scores mask serious performance gaps in deployment contexts, reinforcing why task-specific, on-the-fly evaluation has practical appeal beyond academic novelty.

Watch whether any major model provider or evaluation platform (Hugging Face, LMSYS) formally integrates CoEval-style synthetic task generation into their ranking pipelines within the next six months. Adoption at that level would confirm the approach is trusted beyond the paper's own validation set.

Coverage we drew on

Amazon Shuts Down Internal AI Leaderboard After Employees Cheated · 404 Media

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsCoEval

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.