The Forgotten Worker: Why AI Agents Fail at the Job Humans Do Best
Site Owner
发布于 2026-05-11
AI agents ace single-task benchmarks but collapse when asked to do what humans do every morning: juggle dozens of interdependent tasks. Microsoft Research's CORPGEN paper reveals why — and shows that the bottleneck isn't reasoning. It's memory architecture.

The Forgotten Worker: Why AI Agents Fail at the Job Humans Do Best
Test this yourself. Open your favorite AI assistant and ask it to juggle three things simultaneously: draft an email to a client, check your calendar for conflicts, and summarize the last ten messages in a thread. Then ask it to switch context between those tasks without losing track of where it left off on each one.
Chances are, it stumbles. Not because the model is weak — it's plenty capable. But the benchmark that made it famous tested it on one task at a time. That's not a job. That's a trick question.
This is the dirty secret of the AI agent boom: we've built remarkably capable systems and then forgotten to ask them to do what ordinary workers do every morning before 9 AM.
The Single-Task Lie
The AI industry has a testing problem. Every major benchmark — BrowserArm, GAIA, WebArena — evaluates agents on discrete, isolated tasks. Find the phone number. Book the flight. Reply to the email. These are useful for measuring progress, but they describe almost no real job.
A knowledge worker's actual morning looks nothing like this. You're halfway through a report when a Slack message interrupts you. The report gets paused. The Slack gets answered. While waiting for a reply, you glance at your email and spot something urgent that needs a five-minute response before you forget. The report sits. Eventually you come back, re-read where you left off, and continue. This happens dozens of times a day.
No benchmark captures this. And as a result, no agent is built to handle it.
Microsoft Research published a paper in February 2026 that names this gap directly. Their framework, CORPGEN, simulates what they call Multi-Horizon Task Environments — workloads where an AI must manage dozens of interdependent tasks simultaneously, switching context, tracking dependencies, and reprioritizing as new information arrives. When they ran leading agents through these environments, performance collapsed. Not gradually. Sharply. Completion rates dropped from 16.7% on a light load (12 concurrent tasks) to 8.7% at 46 tasks. Across every system they tested.