MiniMax Agent's Desktop Update: IM + Computer Use — The Digital Avatar is Here
Site Owner
发布于 2026-04-18
MiniMax Agent desktop update redefines the entry point for agents operating computers. IM serves as unified instruction gateway, Computer Use handles GUI-based tasks, and Pocket makes it all accessible from any messaging platform.
MiniMax Agent's Desktop Update: IM + Computer Use — The Digital Avatar is Here
Imagine you're on the subway, and you suddenly remember there might be an important file on your desktop. Previously, you'd have to wait until you got to the office to check.
Now, you send your computer a message on Feishu: "Can you check if there's a 2025 report PDF on my desktop? Find it and send it to me."
The agent retrieves the file and sends it back. Task done.
Here's a question that would've sounded ridiculous five years ago: what if your computer could receive instructions from you via WeChat—and actually execute them? And what if it could do so even when you're nowhere near your desk?
This is the core scenario of MiniMax Agent's desktop update. (Source: MiniMax Agent Update "This time we redesigned how agents operate computers")
Three Scenarios Where Agents Work for You
Remote file retrieval. You're on the subway, in a meeting, traveling for work. Send an IM to the agent, and that file on your desktop arrives. No need to open your computer, no remote desktop required.
Resume screening and Feishu document generation. "Read all resumes in the Resume folder on my desktop. Based on design experience years, quantifiable achievements, and skills match, select the top 3 candidates for an AI Native Product Designer role and generate interview questions. Compile into a Feishu document."
Computer Use reads local files on your computer and understands each resume's content. The Feishu CLI writes the final candidate analysis and interview questions into a Feishu document. The agent sends you the document link when done.
Operating local apps and system settings. "Open System Settings, find Lock Screen, set the screen saver to never activate. Then open the Pocket client to run the daily scheduled task, and send me a screenshot when done."
These tasks—modifying system settings, operating local clients—have no exposed command-line interfaces. Previously, you'd have to do them manually. Now the agent handles everything, showing you each step in IM.
Why Computer Use Has Always Been Broken
Getting an agent to operate a computer and getting it to stably, accurately, and safely complete tasks you've assigned—there's a massive amount of engineering work between those two things.
The tool design problem.
The common approach to Computer Use provides a monolithic computer tool—all operations accomplished through pixel coordinates. Switching windows, clicking buttons, operating web pages—all done by the model counting pixels. Precision and reliability are hard to guarantee.
MiniMax splits desktop operations into four independent tool domains: Desktop Control (screenshots, mouse operations including modifier key combos, keyboard input, scrolling, dragging), Window Manager (window list queries, focusing, minimize/maximize/close/move/resize, app launching), Browser Engine (DOM operations, CSS selector positioning, JavaScript execution, structured navigation), Clipboard (system clipboard read/write).
Different tasks have different optimal execution paths. Window management directly calls system APIs—no need to have the model screenshot and then identify "where is the minimize button." Browser elements are positioned via DOM selectors, far more precise than counting pixels. These four tool domains, combined with lark-cli, wecom-cli, mmx three platform CLIs, plus Bash and filesystem tools—over 60 tools in total. The agent selects the most appropriate tool path based on task type.
The screen adaptation problem.
The first step in Computer Use is getting the model to "see" the screen. But user displays vary wildly—MacBook Retina, external 4K monitors, 1080p, 720p—physical resolution can differ by an order of magnitude.
The same screenshot presents different detail density to the model on different devices. High-resolution screens overload with information; low-resolution screens are blurry and cause mis-clicks.
MiniMax solves this at two levels. Unified coordinate system: the model doesn't output pixel coordinates directly—it outputs a relative position between 0 and 1, which the system converts to real coordinates based on current screen resolution. Screenshot adaptive processing: first capture the screen at physical pixel-level precision, then dynamically scale based on the model's visual input ceiling. Whether the user has a MacBook Retina or external 4K display, the model receives the optimally sized, adapted image.
How to ensure multi-step task reliability.
Real tasks often require dozens or even hundreds of consecutive operations. Any single error—coordinate recognition deviation, window not focused in time, unexpected pop-up blocking—can cause all subsequent steps to fail.
MiniMax's approach: automatically screenshot-verify after each operation. The model confirms "did that last step actually succeed?" If yes, advance to the next step. If not, enter a diagnostic flow—identify the failure reason, try an alternative approach (if the mouse couldn't reach a button, try a keyboard shortcut instead). If retries are exhausted, the agent proactively tells the user which step it's stuck on, rather than continuing blindly. Occasional small problems in multi-step tasks are handled on the spot—they don't accumulate into total failure.
How to manage remote permissions.
When you're not at your computer and send instructions remotely via IM, permission boundaries must be clear. Deleting files, modifying system settings—if these execute without confirmation, the convenience of remote operation becomes a risk.
MiniMax puts permission management in IM too. When the agent prepares to execute high-risk operations, it pauses and pushes an interactive card to Feishu or Slack with the specific action for one-click authorization or rejection. Platforms like WeChat that don't support interactive components handle authorization via text commands. During execution, users can also send instructions to halt the agent at any time. Every critical action goes through the user's own confirmation.
IM as Entry Point, Not a Feature
The essence of this update isn't releasing a new feature—it's redesigning the entry point for agents operating computers.
CLI covers the API-accessible portion. Computer Use takes the portion requiring graphical interfaces. IM serves as the unified instruction entry, awakening both on demand. Pocket carries this capability—you can summon the agent on your computer from Feishu, Slack, or WeChat.
The scope of work an agent can reach now extends to the user's actual desktop.
But MiniMax itself acknowledges: Pocket and Computer Use, as capabilities, remain early-stage. Complex interface recognition accuracy, long-task stability, generalization to new software—there's substantial engineering work ahead on all fronts.
This road is just beginning.
Sources:
- MiniMax Agent Update "This time we redesigned how agents operate computers": https://mp.weixin.qq.com/s?__biz=MzE5MTA3NzcxMQ==&mid=2247488360&idx=1&sn=9358b0c0a0750975e8f4c001b8f1e724