Opinion & AnalysisQuoting Andreas Påhlsson-NotiniAndreas Påhlsson-Notini argues current AI agents inherit human flaws—indecision, impatience, constraint-negotiation—rather than embodying truly alien intelligence. The critique challenges whether today's systems are genuinely autonomous or merely mimicking human problem-solving patterns.Simon Willison·Apr 2164
ResearchThe signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey textResearchers tested whether prompt engineering or model selection better improves LLM accuracy on fan experience ratings from baseball survey text. Prompt tweaks yielded only 2 percentage points of gain (67% to 69% accuracy), while GPT-5.2 and GPT-4.1-mini both underperformed the baseline, suggesting diminishing returns on optimization.arXiv cs.CL·Apr 2142
ResearchModels & ReleasesMicro Language Models Enable Instant ResponsesResearchers developed micro language models (8M–30M parameters) that generate the first few words of responses directly on edge devices like smartwatches, while cloud models complete the sentence—eliminating multi-second latency gaps. The approach matches performance of 70M–256M parameter models while enabling genuinely responsive on-device AI.arXiv cs.CL·Apr 2162
ResearchModels & ReleasesSafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language ModelsResearchers benchmarked eleven multimodal LLMs from Qwen, Gemma, and Gemini families on embodied safety planning in kitchen environments, finding models recognize hazards well in Q&A but fail to mitigate risks when acting as autonomous agents.arXiv cs.CL·Apr 2158
Products & AppsOrdering with the Starbucks ChatGPT app was a true coffee nightmareA Verge reporter's attempt to order coffee through Starbucks' ChatGPT integration exposed usability failures in the AI-powered ordering system, highlighting real-world friction when LLMs handle task-specific workflows.The Verge — AI·Apr 2158
ResearchSAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference under Hard Uplink BudgetsResearchers challenge the standard attention-based approach to edge-cloud inference under bandwidth constraints, showing that semantic diversity of transmitted data matters more than individual importance scores. The work suggests spatially uniform selection can match performance of importance-weighted methods at moderate budgets.arXiv cs.LG·Apr 2152
ResearchTools & CodeThe "Small World of Words" German Free-Association NormsResearchers released SWOW-DE, a dataset of free-association norms for 5,877 German words, filling a gap in multilingual psycholinguistic resources. The norms predict lexical decision performance and enable cognitive science research on semantic structure across languages.arXiv cs.CL·Apr 2142
Products & AppsAI Dungeon maker Latitude unveils Voyage, a platform for creating AI-powered RPGsLatitude, maker of AI Dungeon, launched Voyage, an AI-native platform letting players build custom RPGs. The tool lowers barriers for game creation by automating narrative and world-building tasks typically requiring design expertise.TechCrunch — AI·Apr 2165
Models & ReleasesOpenAI teases GPT-Image 2 with an AI-generated screenshot that looks completely realOpenAI is releasing GPT-Image 2, a new image generation model that has circulated under codename for weeks. Early outputs are visually indistinguishable from photographs, marking a significant leap in photorealism for synthetic imagery.The Decoder·Apr 2192
ResearchCross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language ModelsResearchers benchmarked consistency across GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash when generating exercise prescriptions repeatedly. GPT-4.1 achieved highest semantic stability (0.955) but produced entirely unique outputs each time, revealing a critical tension between reproducibility and diversity that matters for clinical AI deployment.arXiv cs.CL·Apr 2152
Business & FundingProducts & AppsNeura Robotics, AWS Collaborate to Bring Physical AI to the Real WorldNeura Robotics and AWS partnered to address data scarcity in robotics, with Amazon planning to deploy physical AI systems in its fulfillment centers. The collaboration signals enterprise momentum in embodied AI as cloud providers move beyond software into warehouse automation.AI Business·Apr 2161
ResearchTools & CodeRoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for RomanianResearchers released RoLegalGEC, the first Romanian-language dataset for grammatical error detection and correction in legal documents. The work addresses a gap in domain-specific NLP training data by combining synthetic generation with structured grammar understanding, enabling better error-correction tools for legal professionals.arXiv cs.LG·Apr 2142
ResearchAn Efficient Black-Box Reduction from Online Learning to Multicalibration, and a New Route to $Φ$-Regret MinimizationResearchers prove that online multicalibration can be solved efficiently by combining any no-regret learner with an expected variational inequality solver, resolving an open problem from SODA '24 and establishing new connections between multicalibration and regret minimization.arXiv cs.LG·Apr 2158
ResearchA Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational PoetryResearchers released A Bolu, the first structured corpus of Sardinian improvisational poetry with 2,835 stanzas, addressing a gap in NLP resources for minority languages and oral linguistic heritage preservation.arXiv cs.CL·Apr 2142
ResearchImpact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AIResearchers analyzed how LLMs have shifted peer review practices at top AI conferences, examining changes in review language, evaluation priorities, and recommendation patterns since model emergence. The study quantifies whether LLMs are reshaping academic gatekeeping beyond surface-level writing style.arXiv cs.CL·Apr 2158
ResearchTools & CodeA Self-Evolving Framework for Efficient Terminal Agents via Observational Context CompressionResearchers introduce TACO, a self-improving compression framework that automatically learns how to reduce redundant observations in terminal agent interactions, addressing the quadratic token-cost problem that limits long-horizon reasoning tasks.arXiv cs.CL·Apr 2158
Business & FundingHardware & InfraAnthropic Seals $100B Infrastructure Deal With AmazonAnthropic secured a $100 billion infrastructure commitment from Amazon, expanding the AI vendor's compute capacity for model development and deployment. The deal underscores intensifying competition among cloud providers to lock in generative AI workloads.AI Business·Apr 2183
ResearchLyapunov-Certified Direct Switching Theory for Q-LearningResearchers derive finite-time convergence guarantees for constant-stepsize Q-learning by modeling it as a stochastic switching system, using joint spectral radius analysis to tighten error bounds beyond standard approaches and provide computable certificates.arXiv cs.LG·Apr 2152
ResearchDiagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as ReferenceResearchers propose a diagnostic framework for ColBERT and other late-interaction retrieval models, using learned latent spaces to surface systematic failures in biomedical ranking tasks. The work addresses a gap in model interpretability: while token-level scores explain individual rankings, they don't reveal whether models reliably understand clinical concepts across varied phrasings.arXiv cs.CL·Apr 2152
ResearchDetecting Hallucinations in SpeechLLMs at Inference Time Using Attention MapsResearchers propose attention-map-based metrics to detect hallucinations in speech LLMs at inference time without requiring gold-standard outputs. The method, tested on Qwen-2-Audio and Voxtral-3B, uses lightweight classifiers to identify pathological attention patterns specific to audio, outperforming uncertainty-based baselines.arXiv cs.LG·Apr 2152
ResearchModels & ReleasesStructure-guided molecular design with contrastive 3D protein-ligand learningResearchers combined SE(3)-equivariant transformers with contrastive learning to encode 3D protein-ligand structures into shared embeddings, then integrated these into a multimodal chemical language model for structure-guided drug discovery. The approach achieves competitive zero-shot virtual screening while generating synthetically accessible molecules conditioned on pocket or ligand data.arXiv cs.LG·Apr 2158
ResearchSeparating Geometry from Probability in the Analysis of GeneralizationResearchers challenge the foundational i.i.d. assumption in generalization theory, proposing sensitivity analysis of optimization solutions as an alternative framework that doesn't require unverifiable probabilistic assumptions about data distribution.arXiv cs.LG·Apr 2152
ResearchModels & ReleasesEnhancing Construction Worker Safety in Extreme Heat: A Machine Learning Approach Utilizing Wearable Technology for Predictive Health AnalyticsResearchers in Saudi Arabia built attention-enhanced LSTM models to predict heat stress in construction workers using smartwatch data, achieving 95.4% accuracy and reducing false alarms. The work demonstrates how interpretable deep learning can translate wearable physiological signals into real-time safety alerts for high-risk outdoor labor.arXiv cs.LG·Apr 2152
ResearchTaming Actor-Observer Asymmetry in Agents via Dialectical AlignmentResearchers discovered that LLM agents in multi-agent frameworks exhibit actor-observer asymmetry, a cognitive bias where agents blame external factors for failures when self-reflecting but attribute identical errors to internal causes when auditing peers. A new benchmark quantifies this phenomenon and its impact on agent reliability.arXiv cs.CL·Apr 2162
ResearchEmotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph AlignmentResearchers propose a semantic decoupling approach to emotion-cause pair extraction in conversations, separating emotion and cause semantics into distinct representation spaces and framing the task as global alignment rather than independent classification. The method aims to capture many-to-many conversational causality more accurately than existing pairwise approaches.arXiv cs.CL·Apr 2142
Products & AppsPolicy & RegulationYouTube expands its AI likeness detection technology to celebritiesYouTube is rolling out AI-powered deepfake detection to celebrities and their representatives, enabling them to identify and request removal of synthetic media impersonating them. The expansion targets a growing problem of AI-generated celebrity likenesses used without consent.TechCrunch — AI·Apr 2165
ResearchModels & ReleasesCalibrating Scientific Foundation Models with Inference-Time Stochastic AttentionResearchers propose Stochastic Attention, an inference-time technique that adds calibrated uncertainty to transformer-based scientific models by randomizing attention weights via multinomial sampling. The method generates predictive ensembles without retraining and requires only a single hyperparameter tuned post-hoc, tested on weather and timeseries forecasting models.arXiv cs.LG·Apr 2158
ResearchRevisiting RaBitQ and TurboQuant: A Symmetric Comparison of Methods, Theory, and ExperimentsA reproducibility audit finds TurboQuant fails to outperform RaBitQ in head-to-head quantization tests, contradicting prior claims and raising questions about reported benchmarks from the original TurboQuant paper.arXiv cs.LG·Apr 2152
ResearchEvaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based DetectionResearchers developed a pipeline using LLMs to generate and evaluate obfuscated XSS payloads, combining deterministic transformations with runtime browser validation to test whether machine learning detection systems can identify morphed attack variants that preserve malicious behavior.arXiv cs.LG·Apr 2152
ResearchAccelerating Optimization and Machine Learning through DecentralizationResearchers demonstrate that decentralized machine learning can converge faster than centralized training, challenging the conventional view that distributed optimization is merely a privacy-preserving compromise. The finding suggests practitioners may gain both privacy and computational efficiency by distributing model training across edge devices rather than centralizing data.arXiv cs.LG·Apr 2158