Research Tools & Code·arXiv cs.CL·2d ago

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

EntSQL addresses a blind spot in text-to-SQL evaluation: enterprise deployments where proprietary business logic, internal metrics, and organizational conventions matter as much as schema design. Most benchmarks like Spider and BIRD test generalization across public databases, but miss the grounding challenge that real-world SQL systems face when operating over private knowledge bases. This 1,066-example bilingual dataset spanning five domains signals growing recognition that LLM-to-database pipelines need domain-specific validation before production use, particularly in regulated or knowledge-heavy sectors where hallucinated business rules carry real cost.

Modelwire context

Explainer

EntSQL's actual novelty isn't the dataset size but the explicit focus on proprietary business logic and organizational conventions as evaluation targets. Prior benchmarks (Spider, BIRD) tested schema generalization; this one tests whether models can ground queries in domain-specific rules that don't live in the schema itself.

This connects directly to the broader pattern in recent benchmarking work: static evaluation is giving way to context-aware, domain-specific validation. ClinEnv (early June) forced models to operate under real clinical constraints and sequential decision-making; EntSQL applies the same principle to database queries, recognizing that production deployments fail not because models can't parse SQL syntax but because they misunderstand what 'revenue' or 'active user' means in a specific organization. Both papers reject the assumption that generalization across public examples predicts real-world performance.

If EntSQL's bilingual coverage (English and Chinese) shows similar performance gaps between the two languages, that signals the grounding problem is linguistic, not just domain-specific. If performance on the five domains clusters by industry type rather than schema complexity, that confirms business logic is the actual bottleneck, not SQL generation.

Coverage we drew on

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsEntSQL · Spider · BIRD · Spider 2.0

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.