ResearcharXiv WatchJun 22, 06:00 PM

New reasoning benchmark adds long-horizon planning and tool-verification tasks

Summary

Researchers argue multiple-choice tests no longer capture agentic systems, with new tasks closer to real workflows.

Region

Global

Heat Score