Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

Tabular foundation models show promise individually but fail to compound gains through ensembling, a critical finding for practitioners betting on TFM adoption. Researchers benchmarked six modern TFMs across 153 classification tasks and found near-perfect correlation between models, creating a diversity ceiling that limits ensemble upside. The best stacking approach yields only 0.18% accuracy improvement over the strongest single model while consuming 253x more compute. Statistical analysis groups three ensemble methods with the best base model in an equivalence class, suggesting practitioners should question whether ensemble complexity justifies its cost in production tabular ML workflows.
Modelwire context
Analyst takeThe buried finding here is not the diversity ceiling itself but the calibration trap: ensembling TFMs degrades probability estimates even when accuracy holds, which matters far more in risk-sensitive applications than the 0.18% accuracy delta suggests.
This pairs directly with the same-day coverage of 'Distilling Tabular Foundation Models for Structured Health Data,' which showed a 26x inference speedup with 90% performance retention. Together, the two papers sketch a consistent picture: the path to production-viable TFMs runs through compression and simplification, not through stacking complexity on top of complexity. If ensembling adds 253x compute for negligible accuracy gain while distillation recovers most single-model performance at a fraction of the cost, the practical roadmap for regulated domains like healthcare becomes clearer. Gradient-boosted trees remain the relevant competitive baseline here, and neither paper has displaced that benchmark in a way that would shift procurement decisions today.
Watch whether any TFM vendor publishes ensemble results on OpenML-CC18 using architecturally diverse model families rather than same-family variants. If cross-architecture ensembles break the correlation ceiling observed here, the diversity argument revives; if they don't, the single-best-model heuristic becomes the durable practitioner default.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsTabular Foundation Models · OpenML · Gradient-boosted trees · Cascade stacking
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.