Top Papers — Mar 8
The highest-scoring arXiv ML papers from Mar 8, ranked by LLM relevance.
Nghi D. Q. Bui · 2026-03-05
The landscape of AI coding assistance is undergoing a fundamental shift from complex IDE plugins to versatile, terminal-native agents. Operating directly where developers manage source control, execut…
ELita Lobo +7 · 2026-03-05
Recent advances in large language models (LLMs) have enabled agentic systems for sequential decision-making. Such agents must perceive their environment, reason across multiple time steps, and take ac…
Jonathan D. Chang +25 · 2026-03-05
We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work …
Sunishchal Dev +4 · 2026-03-05
We present the Judge Reliability Harness, an open source library for constructing validation suites that test the reliability of LLM judges. As LLM based scoring is widely deployed in AI benchmarks, m…
Sicheng Fan +6 · 2026-03-05
We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectori…
Md Farhan Ishmam +1 · 2026-03-05
The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web …
Preetam Prabhu Srikar Dammu +3 · 2026-03-04
With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many wid…
Yixia Li +9 · 2025-12-21
Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a pote…
Yurun Chen +10 · 2025-10-01
As multimodal LLM-driven agents advance in autonomy and generalization, traditional static datasets face inherent scalability limitations and are insufficient for fully assessing their capabilities in…
Benjamin Feuer +2 · 2026-03-05
As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loo…
Want papers like these in your inbox?
PaperBrief sends you a personalised daily digest of the arXiv papers that actually matter for your research track.
Get your personalised digest →