Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

Illustration accompanying: Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

Researchers benchmarked consistency across GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash when generating exercise prescriptions repeatedly. GPT-4.1 achieved highest semantic stability (0.955) but produced entirely unique outputs each time, revealing a critical tension between reproducibility and diversity that matters for clinical AI deployment.

Modelwire context

Explainer

The study's sharpest finding isn't which model scored highest, it's that semantic stability and output reproducibility are measuring fundamentally different things, and conflating them in clinical settings could produce patient-facing variation that aggregate scores simply won't catch.

The reliability problem here echoes what we covered in 'Diagnosing LLM Judge Reliability' (arXiv, April 16), where aggregate consistency metrics looked strong at roughly 96% while one-third to two-thirds of individual documents showed logical inconsistencies underneath. The pattern is the same: summary-level numbers flatter, instance-level behavior is messier. That prior piece was about evaluation pipelines, but the structural warning transfers directly to clinical deployment. The MADE benchmark piece from the same week also flagged uncertainty quantification as a hard requirement for high-stakes healthcare applications, which aligns with what this exercise-prescription study is surfacing from a different angle. Neither GPT-Rosalind's launch nor the broader coding-AI competition covered recently has much bearing here.

Watch whether any of the three labs respond to this class of clinical-consistency research by publishing reproducibility controls or temperature-locking guidance specifically for regulated healthcare use cases within the next two quarters. Silence would itself be informative.

Coverage we drew on

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGPT-4.1 · Claude Sonnet 4.6 · Gemini 2.5 Flash

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.