ResearcharXiv Watch
New reasoning benchmark adds long-horizon planning and tool-verification tasks
Summary
Researchers argue multiple-choice tests no longer capture agentic systems, with new tasks closer to real workflows.
Region
Global
Heat Score
85
Category
Research
Language
en
