
Molecular Representations for Large Language Models
Researchers have systematized a critical gap in LLM chemistry workflows by introducing MolJSON, a purpose-built molecular representation format, and benchmarking it against five incumbent standards across multiple frontier models. The work matters because chemistry-focused LLM systems depend on reliable molecular encoding, yet the field has defaulted to SMILES and IUPAC names without rigorous comparative validation. This evaluation across GPT-5 variants and Claude establishes which representations maximize reasoning accuracy on translation and structure tasks at scale (78K+ test cases), directly informing how labs architect chemistry agents and whether domain-specific tokenization strategies outperform generic text formats.62










