Models & ReleasesResearchBehind the Scenes Hardening Firefox with Claude Mythos PreviewMozilla's early access to Claude Mythos enabled systematic vulnerability discovery across Firefox's codebase, flipping the script on AI-assisted security audits. Where LLM-generated bug reports were previously dismissed as low-signal noise, Anthropic's latest model demonstrated sufficient precision to surface hundreds of genuine exploitable flaws. This marks a inflection point for AI-assisted security work: maintainers now face pressure to treat machine-generated findings seriously, while the economics of vulnerability disclosure shift toward automated detection at scale. The episode signals that frontier LLMs are crossing into domains where false positives carry real cost, forcing open-source governance to adapt.Simon Willison·May 789
ResearchModels & ReleasesBeyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative GradientsResearchers propose Positive-Only Policy Optimization (POPO), a refinement to reinforcement learning methods for LLM reasoning that sidesteps a core limitation in Group Relative Policy Optimization (GRPO). The key insight: penalizing sparse negative samples under binary reward signals fails to capture failure gradation, whereas learning exclusively from positive rollouts with implicit negative gradients may yield stronger signal efficiency. This addresses a real bottleneck in the RLVR pipeline as the field races to scale reasoning capabilities beyond current PPO and GRPO baselines.arXiv cs.CL·May 762
ResearchTools & CodeSuperintelligent Retrieval Agent: The Next Frontier of Information RetrievalResearchers propose SIRA, a retrieval-augmented agent that collapses multi-turn exploratory search into single, corpus-aware queries by learning domain-specific retrieval priors. This addresses a fundamental inefficiency in how LLM-based systems interact with knowledge bases: current agents waste rounds reformulating queries like novices rather than leveraging structural knowledge like experts. The work matters because retrieval latency and recall directly impact production RAG systems at scale, and a compression mechanism could reshape how enterprises deploy agents over proprietary data.arXiv cs.LG·May 758
ResearchInductive Venn-Abers and related regressorsResearchers have extended Venn-Abers predictors, a class of probabilistic classifiers known for statistical validity guarantees, from binary classification into unbounded regression by incorporating conformal prediction techniques. This generalization addresses a longstanding limitation in the field: prior work only handled binary or bounded regression cases. Empirical results suggest the derived point regressors modestly outperform standard baselines on larger datasets, making the approach potentially valuable for practitioners building calibrated prediction systems where uncertainty quantification and formal validity bounds matter alongside raw accuracy.arXiv cs.LG·May 752
ResearchModels & ReleasesEdge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield predictionResearchers have developed a graph-neural approach to predict fluorescent protein brightness by modeling how local chemical environments around chromophores influence quantum yield, moving beyond sequence-based protein language models. The method converts 3D protein structures into typed residue graphs partitioned by chromophore subregion, then applies channel-specific signal propagation to extract 52 interpretable physical features for band-specific regression. This work exemplifies how domain-specific geometric inductive biases and mechanistic decomposition can outperform end-to-end learned representations in molecular property prediction, a pattern increasingly relevant as ML practitioners optimize for interpretability and sample efficiency in structural biology.arXiv cs.LG·May 754
ResearchTools & CodeAre We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark StudyMultimodal domain generalization research lacks standardized evaluation, making it unclear whether performance improvements reflect genuine algorithmic breakthroughs or experimental inconsistencies. MMDG-Bench addresses this fragmentation by establishing the first unified benchmark across datasets, modality configurations, and real-world failure modes including corruptions and missing inputs. This standardization effort matters because it directly impacts how practitioners assess robustness claims in production systems and signals a field maturation moment where reproducibility and comparability become prerequisites for credible progress.arXiv cs.LG·May 758
ResearchModels & ReleasesStraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory AbstractionStraTA addresses a fundamental bottleneck in agentic LLM training: long-horizon decision-making without reactive collapse. By sampling high-level strategies upfront and conditioning action sequences on them, the framework decouples exploration from credit assignment, enabling hierarchical RL at scale. The approach combines GRPO-style rollouts with strategy diversity and self-critique, tested across interactive environments like ALFWorld and WebShop. This matters because most deployed LLM agents still struggle with multi-step reasoning and exploration trade-offs. Insiders should track whether this hierarchical abstraction pattern becomes standard in production agentic systems.arXiv cs.CL·May 762
ResearchConcept-Based Abductive and Contrastive Explanations for Behaviors of Vision ModelsResearchers have unified two separate interpretability threads by proposing concept-based abductive and contrastive explanations for vision models. Rather than explaining predictions through either high-level concepts alone or low-level pixel features, this work identifies minimal sets of human-understandable concepts that causally drive model outputs. The advance matters because it bridges the gap between formal causal reasoning and practical explainability, enabling practitioners to understand not just what a vision model sees but why it decides, with explicit causal grounding. This directly addresses a core pain point in model deployment: regulators and users increasingly demand explanations that go beyond black-box confidence scores.arXiv cs.LG·May 762
ResearchModels & ReleasesRecursive Agent OptimizationResearchers propose Recursive Agent Optimization, a training method that enables AI agents to spawn and delegate subtasks to themselves recursively, effectively implementing divide-and-conquer at inference time. The approach addresses a fundamental scaling bottleneck: agents trained with RAO generalize to problems harder than their training distribution, handle contexts exceeding their native window, and achieve faster wall-clock inference through strategic task decomposition. This technique matters because it decouples model capability from context length and problem difficulty, potentially reshaping how practitioners approach scaling beyond simple parameter increases or longer context windows.arXiv cs.CL·May 768
ResearchModels & ReleasesCan RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is KeyResearchers have built ScaleLogic, a synthetic benchmark that isolates two independent variables in LLM reasoning: proof depth and logical expressiveness. By systematically varying task complexity across implication-only through first-order logic with quantifiers, the work reveals how RL training compute scales with reasoning difficulty. This addresses a long-standing gap in understanding whether current RL methods can push LLMs toward genuine long-horizon planning or merely memorize shallow patterns. The findings matter for anyone betting on RL as a path to more capable reasoning systems.arXiv cs.CL·May 762
ResearchTools & CodeCited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research AgentsResearch agents built on LLMs routinely cite sources in synthesized reports, but those citations go largely unverified, creating a credibility gap between apparent rigor and actual accuracy. This paper introduces the first systematic framework for extracting and validating inline citations from model-generated markdown at scale, using AST parsing to retrieve actual source content and measure consistency between claims and their references. The work addresses a critical blind spot in production AI systems: while RAG improves factuality, it doesn't guarantee that cited sources are accessible, relevant, or actually support the claims attributed to them. For teams deploying research agents or evaluating LLM outputs, this framework offers a reproducible method to audit citation integrity and expose hallucinated or mismatched attributions.arXiv cs.CL·May 762
ResearchCrafting Reversible SFT Behaviors in Large Language ModelsResearchers propose Loss-Constrained Dual Descent, a method to compress supervised fine-tuning behaviors into sparse, mechanistically necessary subnetworks that remain controllable at inference without weight modification. This addresses a critical gap in LLM interpretability: existing circuit attribution methods identify correlations post-hoc but cannot guarantee causal necessity or enable selective behavior control. The work matters for practitioners seeking fine-grained control over model outputs and for safety teams needing to isolate and modify specific learned behaviors without full retraining, advancing the frontier of mechanistic understanding beyond correlation-based approaches.arXiv cs.LG·May 762
ResearchHardware & InfraHybrid Quantum-Classical GANs for the Generation of Adversarial Network FlowsResearchers are combining quantum computing with classical machine learning to address fundamental GAN limitations in adversarial network traffic generation. By encoding latent vectors as quantum states rather than sampling classical noise, the hybrid QC-GAN framework claims to achieve more expressive representations while reducing computational overhead, potentially lowering barriers to training on high-dimensional security datasets. This work sits at the intersection of quantum machine learning maturation and adversarial ML, signaling that quantum advantage may first emerge in specialized domains like synthetic data generation before broader deployment.arXiv cs.LG·May 752
ResearchTools & CodeLiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time OperationLiVeAction addresses a critical bottleneck in edge AI: compressing high-dimensional sensor data without sacrificing machine-perception accuracy. Unlike human-centric codecs (JPEG, MPEG), this neural compression scheme targets wearable and remote devices constrained by bandwidth and power, handling non-standard modalities like hyperspectral imagery and spatial audio. The work signals growing recognition that general-purpose compression wastes signal structure; specialized tokenizers that exploit domain-specific redundancy unlock better rate-distortion trade-offs for downstream ML tasks. This matters for robotics, medical imaging pipelines, and IoT deployments where inference happens on-device.arXiv cs.LG·May 758
ResearchTools & CodePianoCoRe: Combined and Refined Piano MIDI DatasetPianoCoRe unifies fragmented symbolic music datasets into a 250k-performance corpus spanning 5,625 classical pieces, addressing a critical bottleneck in music information retrieval and generative audio research. The tiered release strategy, from raw pre-training data to fine-grained note-aligned subsets, enables both large-scale model training and expressive performance modeling. This infrastructure move matters because symbolic music remains underexplored in foundation model development compared to text and images, and standardized, aligned datasets are prerequisites for advancing music understanding and generation systems at scale.arXiv cs.LG·May 758
ResearchTools & CodeParser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotationResearchers demonstrate that parser agreement can reliably signal annotation quality in morphosyntactic parsing for L2 Korean, enabling a scalable human-in-the-loop workflow that reduces manual labeling burden. The work reveals systematic failure modes in parser disagreement, clustering around grammatical relations and clause boundaries, which points toward both immediate model refinement opportunities and deeper representational gaps in handling non-native language syntax. This bridges practical annotation efficiency with interpretability insights relevant to building more robust multilingual NLP systems.arXiv cs.CL·May 752
Policy & RegulationEurope's answer to AI regulation complexity is to just delay most of itThe EU's revised AI rulebook trades enforcement rigor for implementation speed, deferring high-risk AI compliance to 2027-2028 while exempting SMEs from stricter obligations. The move signals regulatory pragmatism over precaution, though immediate wins include explicit bans on non-consensual synthetic media and August 2026 labeling mandates for deepfakes and generated text. For builders, this creates a two-tier compliance landscape where larger players face delayed but eventually tighter scrutiny, while smaller competitors gain breathing room. The strategy reflects Brussels acknowledging that overly aggressive timelines risked fragmenting the European AI market.The Decoder·May 773
ResearchTools & CodeMASPO: Joint Prompt Optimization for LLM-based Multi-Agent SystemsMulti-agent LLM systems face a fundamental coordination problem: individual agent prompts optimized in isolation often fail to serve the broader system goal. MASPO tackles this by introducing a joint evaluation framework that scores prompts not on local validity alone, but on their capacity to enable downstream agent success. This addresses a critical gap in agentic AI deployment, where prompt engineering has remained largely manual and siloed. For teams building production multi-agent workflows, this represents a shift toward systematic, automated prompt alignment across agent hierarchies, potentially reducing the trial-and-error cycles that currently plague complex orchestration tasks.arXiv cs.CL·May 762
ResearchAlgospeak, Hiding in the Open: The Trade-off Between Legible Meaning and Detection AvoidanceResearchers formalize the adversarial dynamics between language models and evasion tactics, introducing Majority Understandable Modulation (MUM) to quantify where Algospeak breaks down. The work maps a critical tension in content moderation: as users obfuscate text to evade detection, readability collapses for ordinary audiences, not just filters. This framework matters because it exposes a structural limit to linguistic arms races, suggesting that perfect evasion and human comprehension cannot coexist. For platform builders and safety teams, the finding implies moderation pressure may self-correct through degraded user experience rather than technical intervention alone.arXiv cs.CL·May 762
ResearchWhen and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower BoundsResearchers have cracked a long-standing puzzle in optimization theory: why sign-based gradient methods like SignSGD and Muon outperform standard SGD in large model training despite lacking theoretical justification. The breakthrough reframes the problem using L1-norm stationarity and coordinate-wise noise models rather than standard L2 smoothness assumptions, which had previously proven sign-based methods couldn't beat SGD. This work matters because it validates the algorithmic choices already embedded in production foundation model training pipelines, potentially unlocking further efficiency gains as practitioners now understand the mathematical conditions under which these cheaper, faster methods genuinely dominate.arXiv cs.CL·May 762
ResearchTools & CodeSkillOS: Learning Skill Curation for Self-Evolving AgentsSkillOS addresses a critical limitation in deployed LLM agents: their inability to retain and build on past interactions. The system trains agents to autonomously curate reusable skills from experience using reinforcement learning, moving beyond manual skill engineering or fixed heuristics. This tackles a fundamental bottleneck in agent self-improvement, where learning effective long-term curation policies from sparse feedback has remained unsolved. For practitioners deploying agents at scale, this represents a path toward systems that genuinely evolve rather than reset, potentially reducing the operational overhead of continuous manual skill management.arXiv cs.CL·May 762
Models & ReleasesProducts & AppsWe’re introducing three audio models in the APIOpenAI has released three production audio models that materially expand real-time voice capabilities for developers. GPT-Realtime-2 brings GPT-5-class reasoning to conversational AI, enabling more complex dialogue handling. GPT-Realtime-Translate covers 70+ input languages with live output in 13 languages, addressing a long-standing localization gap. GPT-Realtime-Whisper provides streaming transcription that keeps pace with natural speech. Together, these models signal OpenAI's shift toward multimodal, low-latency inference as a core platform offering, likely forcing competitors to accelerate similar voice stacks.OpenAI (YouTube)·May 787
ResearchOnline Bayesian Calibration under Gradual and Abrupt System ChangesResearchers propose a framework for real-time Bayesian calibration that handles both gradual drift and sudden shifts in system behavior, addressing a critical gap in digital twin deployment. Classical calibration methods assume static environments and conflate model parameters with bias correction, limiting their use in production systems that evolve over time. This work extends data assimilation techniques with explicit bias modeling, enabling sequential updates under non-stationary conditions. The advance matters for practitioners building adaptive digital twins in manufacturing, climate modeling, and engineering where systems degrade or transition between operational regimes without retraining from scratch.arXiv cs.LG·May 752
ResearchThe Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension DisparityResearchers have identified the mechanistic root of attention sink, a widespread pathology in LLMs where early tokens capture disproportionate attention weight. The work traces the problem to variance asymmetries in value aggregation during self-attention, then shows how sparse FFN down-projections amplify this effect by creating dimensional misalignment in first-token representations. This finding matters because attention sink degrades model efficiency and output quality, and understanding its structural origin opens paths to architectural fixes rather than post-hoc patches. The causal chain validation suggests interventions at the FFN level could reshape how transformers distribute representational load.arXiv cs.LG·May 762
ResearchSoftSAE: Dynamic Top-K Selection for Adaptive Sparse AutoencodersResearchers propose SoftSAE, a dynamic variant of Sparse Autoencoders that adapts sparsity levels per input rather than enforcing uniform feature activation across all samples. This addresses a fundamental limitation in mechanistic interpretability: real-world data exhibits varying intrinsic dimensionality, yet fixed-K architectures waste capacity on simple inputs and starve complex ones. The work directly impacts SAE-based interpretability workflows for LLMs and vision models, suggesting that adaptive sparsity could improve both feature decomposition fidelity and computational efficiency in neural network analysis.arXiv cs.LG·May 758
ResearchTransformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient DescentResearchers have constructed transformers that provably execute in-context logistic regression by implementing normalized gradient descent across layers, bridging the gap between transformer behavior and classical optimization algorithms. This work clarifies a fundamental mechanism underlying in-context learning: rather than operating as black boxes, attention-based models can be engineered to perform explicit algorithmic steps on context data. The finding matters because it grounds transformer capabilities in interpretable computation, potentially enabling better architectural design and offering a template for understanding how other algorithms might be embedded in neural networks.arXiv cs.LG·May 762
ResearchDARTS: Targeting Prognostic Covariates in Budget-Constrained Sequential ExperimentsResearchers propose DARTS, a framework that reframes randomized controlled trials as a machine learning optimization problem. The core insight treats covariate measurement as a budget-constrained sequential decision, using Thompson sampling to identify which pretreatment features matter most for reducing treatment effect variance. This bridges causal inference and adaptive experimentation, with implications for how ML systems can be validated under real-world resource constraints. The decoupling result suggests practitioners can decouple covariate selection from downstream analysis, potentially reshaping how expensive observational data is prioritized in production ML pipelines.arXiv cs.LG·May 752
ResearchTools & CodeHow Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM EvaluationResearchers introduce DAPRO, a dynamic budget allocation framework that improves how AI labs evaluate LLM safety and behavior in multi-turn conversations. Current evaluation methods waste computational resources by spreading testing uniformly across interaction rounds, missing rare but critical events like jailbreaks that emerge unpredictably. DAPRO adapts budget allocation in real time, concentrating compute where signal is highest, making it feasible to construct statistically valid lower bounds on time-to-event under realistic constraints. This matters for safety teams: better evaluation efficiency means more thorough red-teaming and adversarial testing at lower cost, directly improving confidence in deployment decisions.arXiv cs.LG·May 758
ResearchWeight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and GeneralizationResearchers have established the first rigorous mathematical foundation for understanding how weight decay shapes Transformer optimization landscapes, proving that L2-regularized cross-entropy loss satisfies Villani's coercive energy criteria. This functional-analytic characterization yields explicit constants governing convergence and generalization behavior, bridging a gap between empirical regularization practice in large language models and theoretical guarantees. The work matters for practitioners because it formalizes why weight decay stabilizes training and provides quantitative bounds on optimization dynamics that could inform better hyperparameter selection and architecture design for scaling.arXiv cs.LG·May 758
ResearchUniSD: Towards a Unified Self-Distillation Framework for Large Language ModelsUniSD addresses a fundamental bottleneck in LLM adaptation: how to improve models through self-generated feedback without external supervision. The framework unifies previously scattered techniques for stabilizing self-distillation, combining multi-teacher consensus, exponential moving average regularization, and contrastive learning to handle the inherent noise in autoregressive trajectories. This matters because self-distillation sidesteps the cost and availability constraints of stronger teacher models, making model refinement more accessible. The systematic integration of complementary mechanisms signals a maturation in the field's understanding of when and why self-supervision works, with implications for both open-source and commercial model development workflows.arXiv cs.CL·May 758