
Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets
A new method addresses a critical gap in how LLMs handle numeric tabular data, which dominates scientific workflows but lacks native representation in foundation models. The approach combines exploratory data analysis descriptors with sentence transformers and Canonical Correlation Analysis to enable cross-dataset similarity and alignment without requiring shared variable definitions. This work matters because it bridges the disconnect between LLM strengths in text and the practical need to reason over heterogeneous numeric datasets at scale, opening pathways for more interpretable dataset discovery and transfer learning across scientific domains.58
























