Tools & CodeProducts & Appsdatasette-agent 0.1a4Datasette-agent, an AI chat interface for querying databases, now integrates directly into Datasette's navigation layer via a new JavaScript plugin hook. The 0.1a4 release leverages Datasette 1.0a30's makeJumpSections() API to surface agent chat as a keyboard-accessible command (slash menu), embedding agentic AI workflows into developer tooling rather than requiring separate interfaces. This reflects a broader shift toward embedding LLM agents into existing infrastructure and developer workflows, reducing friction for data exploration tasks.Simon Willison·May 2467
Opinion & AnalysisResearchQuoting Armin RonacherArmin Ronacher, maintainer of Pocoo projects, identifies a critical failure mode in open-source issue reporting: LLM-generated submissions that obscure rather than clarify problems. These AI-reworded reports trade accuracy for false confidence, producing speculative root causes, unreproducible test cases, and misaligned code analogies. The pattern signals a growing friction point where LLM intermediation degrades signal quality in collaborative software development, forcing maintainers to spend cycles filtering noise rather than solving genuine bugs.Simon Willison·May 2477
Models & ReleasesTools & Code⚡️ Google's Open AI Strategy , Omar Sanseviero, Google DeepMindGoogle DeepMind's Gemma 4 introduces a parameter-offloading architecture that decouples effective from active parameters, allowing models to run on-device with only a fraction loaded into GPU memory at inference time. This efficiency breakthrough targets mobile and edge deployment, directly competing with Apple's on-device inference strategy and reshaping expectations around model size versus practical deployment cost. The shift signals a strategic pivot in open-source model design away from raw scale toward architectural efficiency, with implications for the entire on-device AI ecosystem.Latent Space·May 2480
Tools & CodeBusiness & Funding⚡️ Why you should build Science Fiction , Sunil Pai, CloudflareCloudflare is positioning Durable Objects and Dynamic Workers as a runtime foundation for AI agent infrastructure, directly competing with managed platforms like Anthropic's cloud agents. The conversation surfaces a critical gap in the agent-building landscape: the absence of a standardized, cross-platform architecture pattern (analogous to React's role in frontend development). This matters because fragmentation across agent frameworks raises switching costs and slows adoption. Insiders should track whether Cloudflare's edge-compute approach gains traction as an alternative to centralized cloud-managed solutions, particularly for latency-sensitive or cost-conscious deployments.Latent Space·May 2468
Products & AppsOpinion & AnalysisI tried Amazon’s Bee wearable and am both intrigued and slightly creeped outAmazon's Bee wearable represents the latest push by a major cloud provider into always-on AI hardware, surfacing a recurring tension in the consumer AI stack: utility versus surveillance risk. The device joins a growing category of ambient intelligence products that offload inference to edge or cloud, raising questions about data collection practices and user consent that regulators and privacy advocates are beginning to scrutinize. For AI infrastructure investors and product teams, Bee signals how quickly wearables are becoming a distribution channel for LLM-backed features, even as the privacy model remains unsettled.TechCrunch - AI·May 2465
ResearchModels & ReleasesByteDance study finds that asking LMMs questions beats making it transcribe text for long document trainingByteDance's Seed model demonstrates that training multimodal systems via question-answering on long documents outperforms transcription-based approaches, enabling a 7B parameter model to match or exceed larger competitors on documents four times longer than its training distribution. This finding reshapes how practitioners should architect document understanding pipelines, shifting focus from OCR-like extraction toward retrieval-augmented reasoning as a core training objective rather than a post-hoc augmentation.The Decoder·May 2473
Opinion & AnalysisResearchDeepmind's Hassabis sees humanity "in the foothills of the singularity" while LeCun says current AI isn't intelligentThree senior AI researchers have staked out divergent positions on whether current systems constitute genuine intelligence or approach AGI. Hassabis frames the field as entering a critical inflection point toward singularity, while LeCun argues today's models lack true reasoning capacity. Vinyals offers a calibration: systems now exceed what would have seemed like AGI in 2019, yet remain fundamentally limited in learning and discovery. This disagreement among DeepMind and Meta leadership signals unresolved questions about capability measurement and timeline expectations that will shape investment, regulation, and research priorities across the industry.The Decoder·May 2473
ResearchPolicy & RegulationHackers are learning to exploit chatbot ‘personalities’Security researchers are uncovering a new attack surface in conversational AI systems: exploiting the behavioral quirks and designed personalities of chatbots to bypass safety guardrails. Unlike early jailbreaks that relied on crude prompt injection, adversaries now target the tension between a model's helpfulness objective and its safety constraints, using personality traits as leverage points. This shift signals that as chatbot defenses mature, attackers are moving upstream to exploit the fundamental design trade-offs baked into instruction-tuning and RLHF processes. For AI teams, this underscores the fragility of behavioral alignment and the need for adversarial testing that goes beyond static prompt lists.The Verge - AI·May 2469
Products & AppsThese Robots Are Making Meals for a Nonprofit in San Francisco’s TenderloinA San Francisco nonprofit has deployed robotic meal preparation systems to address chronic volunteer shortages in the Tenderloin, one of the city's most economically distressed neighborhoods. The deployment signals a pragmatic shift in how nonprofits are adopting automation to sustain social services when human labor proves unavailable or unsustainable. This case study illustrates a broader pattern: AI and robotics are moving beyond corporate efficiency gains into mission-driven sectors where labor scarcity creates genuine operational friction. The outcome will likely influence how other nonprofits evaluate automation ROI in resource-constrained environments.WIRED - AI·May 2458
ResearchTools & CodeLarge Language Model Selection with Limited AnnotationsResearchers have introduced SELECT-LLM, an active learning framework that dramatically reduces annotation costs when benchmarking multiple candidate models against each other. Rather than labeling fixed evaluation sets, the system identifies which queries would most efficiently distinguish between competing LLMs by measuring expected information gain from model output similarities. This approach sidesteps architectural assumptions and weight access, making it applicable across proprietary and open-weight systems alike. For practitioners evaluating dozens of models for production deployment, this addresses a genuine friction point: model selection at scale has been prohibitively expensive. The technique shifts evaluation from exhaustive annotation to strategic sampling, potentially reshaping how teams conduct model triage.arXiv cs.CL·May 2458
Products & AppsResearchWhy you shouldn't leave model selection on default in Copilot, Gemini and other AI toolsDefault model selection in mainstream AI assistants masks a critical reliability gap: identical inputs produce wildly different outputs depending on which underlying model processes them. Mathematician Adam Kucharski's experiment with Copilot revealed the tool fabricates country-specific stereotypes when fed unlabeled data, a failure that advanced reasoning models catch but only when users explicitly select them. This exposes a usability and trust problem at scale. As AI tools embed deeper into workflows, burying model choice behind defaults risks systematizing hallucination and bias without user awareness or recourse.The Decoder·May 2473
ResearchModels & ReleasesUniversal Boosts, Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language ModelsResearchers demonstrate that sparse autoencoders can steer medical vision-language models at inference time to reduce hallucinations in radiology report generation without retraining. By applying targeted suppression and amplification of learned features across late-layer SAEs, the technique achieves 5-17% improvements in clinical accuracy across three VLM architectures on MIMIC-CXR benchmarks. This work signals a broader shift toward post-hoc steering as a practical alternative to fine-tuning for domain-critical applications, with implications for how practitioners can adapt pretrained models to high-stakes medical settings without computational overhead.arXiv cs.CL·May 2462
ResearchTools & CodeMinerU-Popo: Universal Post-Processing Model for Structured Document ParsingDocument parsing has hit a structural ceiling: VLM-based OCR excels at single-page extraction but fractures multi-page coherence, breaking tables and paragraphs split across boundaries. MinerU-Popo reframes this as a post-processing problem, reconstructing document-level logic from existing OCR outputs rather than retraining models. This matters for RAG pipelines and enterprise search, where fragmented documents degrade retrieval quality. The approach signals a pragmatic shift in the parsing stack: rather than chase end-to-end VLM improvements, teams are layering intelligent reconstruction on top of commodity OCR, lowering the barrier for production document systems.arXiv cs.CL·May 2458
ResearchInvestigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under OptimizationResearchers have unified two previously separate evaluation frameworks for assessing whether language model reasoning traces genuinely reflect underlying model behavior. The work introduces FaithMate, a preference-alignment tool that lets teams optimize models toward either input-perturbation faithfulness or parametric intervention faithfulness, then measures how gains transfer across paradigms. Testing across multiple models and datasets reveals positive correlation between the two approaches, suggesting that improving one form of faithfulness may strengthen the other. This matters for practitioners building interpretable systems, as it clarifies which optimization targets yield more robust explanations of model decisions.arXiv cs.CL·May 2458
ResearchSEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial AttackResearchers have developed SEP-Attack, a method that improves adversarial robustness testing for language models by using ensemble weighting via Determinantal Point Processes to better estimate which surrogate models transfer attacks most effectively. This addresses a critical gap in transfer-based attack research, where prior work treated all submodels equally or used unreliable importance scoring. The technique matters because understanding transferability of adversarial examples across models is essential for building defenses and evaluating real-world vulnerability of deployed systems that attackers cannot directly probe.arXiv cs.CL·May 2452
ResearchNITP: Next Implicit Token Prediction for LLM Pre-trainingResearchers propose Next Implicit Token Prediction, a training method that supplements standard next-token prediction with dense supervision in the model's representation space rather than just discrete output labels. By anchoring hidden states to shallow-layer embeddings as self-supervised targets, NITP aims to prevent representation collapse and anisotropy that can degrade generalization. The technique addresses a fundamental constraint in current LLM pre-training: one-hot supervision leaves latent geometry under-specified. If validated at scale, this could reshape how foundation models are initialized and regularized, particularly for efficiency-focused training regimes where representation quality directly impacts downstream performance.arXiv cs.CL·May 2462
Policy & RegulationBusiness & FundingAnthropic may keep supplying Claude to the NSA despite being flagged as a supply chain risk by the PentagonAnthropic is positioned to maintain its NSA contract despite Pentagon designation as a supply chain risk, a tension rooted in hardware constraints rather than capability gaps. Intelligence agencies face acute shortages of Nvidia's latest Grace Blackwell processors, making Anthropic's Mythos model, which operates on older silicon, strategically valuable despite security concerns. The removal of the contentious 'any lawful use' clause signals negotiated compromise, but the deal underscores how geopolitical AI competition and domestic chip scarcity are reshaping government procurement logic independent of traditional risk frameworks.The Decoder·May 2473
ResearchTools & CodeH$^{2}$MT: Semantic Hierarchy-Aware Hierarchical Memory TransformerH2MT addresses a fundamental bottleneck in transformer inference: the cost of processing irrelevant context in long-input scenarios. By pre-computing a semantic hierarchy and routing queries through it at inference time, the approach reduces wasted computation on unrelated text while avoiding the external storage and indexing overhead that plagues retrieval-augmented generation systems. This matters because it directly tackles prefill latency and memory consumption, two metrics that constrain practical deployment of long-context LLMs. The coarse-to-fine pruning strategy represents a structural shift from flat token processing, potentially reshaping how production systems balance context window size against inference speed.arXiv cs.CL·May 2462
ResearchTools & CodeResearchers let Claude Code discover AI scaling algorithms that humans probably wouldn't have designedA multi-institutional research team deployed an AI coding agent to autonomously search for novel scaling algorithms, yielding a control method that reduces compute requirements by 70 percent relative to standard self-consistency approaches while preserving accuracy. The discovery cost $40 and completed in under three hours, signaling a shift toward machine-driven algorithm design as a path to efficiency gains. This outcome matters because it demonstrates that AI systems can uncover optimization strategies outside human intuition, potentially reshaping how teams approach inference-time scaling and resource allocation in production systems.The Decoder·May 2485
ResearchMultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State ProbingHallucination detection remains a critical blocker for LLM deployment, especially in non-English and low-resource settings where existing confidence-based methods break down. MultiHaluDet tackles this by probing frozen LLM hidden states across all layers without language-specific retraining, using multi-scale attention to surface deep factual inconsistencies. The approach matters because it sidesteps the brittleness of single-layer introspection and avoids the cost of per-language fine-tuning, potentially making hallucination filtering practical at scale across diverse linguistic contexts.arXiv cs.CL·May 2458
ResearchTools & CodeOverview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive ConversationsPsyDefDetect, a shared task at BioNLP 2026, benchmarks AI systems on classifying psychological defense mechanisms in emotional support conversations using a clinically grounded framework. The initiative released PsyDefConv, a 200-dialogue corpus annotated under the Defense Mechanism Rating Scales standard, attracting 172 participants and 563 submissions. This work signals growing investment in clinical NLP and dialogue understanding, pushing language models toward nuanced mental health applications where misclassification carries real stakes. The scale of participation and clinical grounding suggest the field is moving beyond generic conversation tasks toward domain-specific evaluation in high-stakes domains.arXiv cs.CL·May 2458
ResearchQuantifying the Impact of Translation Errors on Multilingual LLM EvaluationA new study exposes a critical blind spot in how the AI industry validates multilingual LLMs: machine-translated benchmarks contain systematic errors that go largely undetected, yet measurably degrade model performance scores. By comparing LLM-based error detection against human expert annotations and quantifying how translation flaws (rather than source problems) drive accuracy drops, the research reveals that current multilingual evaluation metrics may be fundamentally unreliable. This matters because vendors and researchers routinely cite multilingual benchmarks to claim parity across languages, but those claims rest on corrupted data. The findings suggest the field needs either human-vetted translations or far more rigorous automated quality control before drawing conclusions about true cross-lingual capability.arXiv cs.CL·May 2462
ResearchModels & ReleasesWhen Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note GenerationA controlled evaluation of reasoning-enabled frontier LLMs reveals a counterintuitive finding: disabling chain-of-thought reasoning in GPT-5.4 produces superior clinical documentation compared to reasoning-augmented variants across three healthcare benchmarks. The study challenges the assumption that reasoning capabilities automatically improve structured, domain-specific outputs, suggesting that for clinical SOAP note generation, simpler decoding paths may outperform complex inference chains. This has implications for how enterprises deploy reasoning models in regulated settings where output quality and consistency matter more than benchmark performance.arXiv cs.CL·May 2462
ResearchModels & ReleasesDTO: a Differentiable Training Objective for Effective Counterfactual Story RewritingResearchers propose a differentiable training objective that sidesteps the precision-versus-efficiency tradeoff plaguing counterfactual story rewriting. LLMs struggle with this task because edits must be surgical, yet standard maximum-likelihood training lacks the granularity to enforce localized changes without reinforcement learning's computational overhead. This work bridges that gap with a differentiable alternative, potentially unlocking faster iteration on fine-grained text generation tasks where conventional objectives fail to capture the nuance required.arXiv cs.CL·May 2454
ResearchModels & ReleasesTowards a Universal Causal ReasonerResearchers have built UniCo, a synthetic data framework that systematically generates causal reasoning tasks across Pearl's Causal Ladder, translating symbolic examples into code and natural language to reflect real-world scenarios where causality isn't explicitly labeled. The work addresses a critical gap in LLM training: while benchmarks for causal reasoning exist, few datasets enable models to learn generalizable causal inference at scale. By filtering for reasoning shortcuts and grounding answers in formal causal inference, the team produced 66.6K high-quality instances that improved performance on smaller models like Qwen3-4B and Olmo-3-7B-Instruct. This signals growing momentum in making causal reasoning a trainable, composable capability rather than an emergent property, with implications for reliability in domains where causal claims matter.arXiv cs.CL·May 2462
ResearchModels & ReleasesLngram: N-gram Conditional Memory in Latent SpaceResearchers introduce Lngram, a memory architecture that decouples retrieval from transformer computation by learning discrete symbols in latent space rather than relying on tokenizer IDs. The approach addresses a fundamental tension in sequence modeling: balancing compositional reasoning with efficient knowledge lookup. By performing N-gram operations over learned symbols instead of text tokens, Lngram gains modality independence and shows consistent perplexity improvements in long-context settings. The technique also enables post-hoc injection of domain knowledge into existing pretrained models, suggesting a practical pathway for augmenting deployed systems without full retraining.arXiv cs.CL·May 2458
ResearchClustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph LearningResearchers propose KCoT, a framework that unifies chain-of-thought reasoning with graph representation learning by establishing a formal mathematical link between Transformer blocks and k-means clustering. The work addresses a real limitation in existing graph-based LLM reasoning: current methods treat graph structure and semantic reasoning as separate concerns, reducing interpretability and step-by-step coherence. By reframing iterative reasoning as clustering operations, this approach could improve how language models reason over structured data, with implications for knowledge graphs, recommendation systems, and any domain requiring both semantic and topological understanding.arXiv cs.CL·May 2458
ResearchRepeated Sequences Reveal Gaps between Large Language Models and Natural LanguageResearchers have identified a measurable gap between how LLMs and humans organize repeated linguistic patterns across different scales. Using entropy analysis of subsequence distributions, the work reveals that while power-law models fit some ranges of text structure, GPT-generated outputs diverge from human statistical organization in ways existing benchmarks miss. This matters because it exposes a blind spot in current evaluation: models may pass task-based tests while still failing to capture the deep compositional logic of natural language, suggesting that fluency metrics alone obscure fundamental structural deficits in how LLMs learn and reproduce linguistic hierarchy.arXiv cs.CL·May 2458
ResearchModels & ReleasesGeo-Expert: Towards Expert-Level Geological Reasoning via Parameter-Efficient Fine-TuningGeo-Expert demonstrates that domain-specific fine-tuning can compress geological reasoning into smaller models, with an 8B parameter variant outperforming 70B generalists on subsurface and temporal reasoning tasks. The work uses parameter-efficient LoRA adaptation on a custom instruction dataset and introduces Geo-Eval, a specialized benchmark for Earth science reasoning. This signals a broader shift in LLM deployment: vertical specialization via targeted fine-tuning may be more cost-effective than scaling generalist models, particularly for knowledge-intensive domains where hallucination poses real operational risk.arXiv cs.CL·May 2458
ResearchPolicy & RegulationTranslators as Invisible Teachers of AI: Copyright, Translation Memory, and the Political Economy of Linguistic DataA new paper traces how translator labor has become foundational infrastructure for modern AI systems, from statistical machine translation through multilingual LLMs. Translation memories and parallel corpora represent supervised training data of extraordinary value, yet translators have historically been compensated as contract deliverable providers rather than recognized as data contributors. The work examines how copyright frameworks have obscured translators' role in building the linguistic foundations that enabled the Transformer era, raising questions about data provenance, labor attribution, and the political economy of AI training at scale.arXiv cs.CL·May 2462