Star History Monthly Apr 2026

Last month we covered the skills ecosystem. This month: agents that don't just write code, but drive a screen.

"Computer use" is industry shorthand — coined by Anthropic in October 2024, adopted by OpenAI in January 2025 — for vision-based GUI control. A coding agent edits text inside a sandbox you built for it. A computer-use agent moves a cursor on the same desktop you use.

Seven projects, three roles: the model that perceives the screen, the harness (Anthropic's term) that loops perception into action, and the infrastructure that gives the agent a desktop to drive. Which role you bet on says a lot about where you think the value lives.

Project	Role	The bet
UI-TARS	Model	Native GUI model — pixels in, clicks out
Fara	Model	7B model that runs on your laptop
self-operating-computer	Harness	The original model-agnostic loop
UI-TARS-desktop	Harness	Vertical integration — model + runtime, co-designed
Agent-S	Harness	Vision + retrieval — looks up how to use the app
Bytebot	Harness + Infra	Containerized desktop, not yours
cua	Infrastructure	Neutral substrate — bring your own model and harness (BYO)

You need all three roles, but not always three projects. Model and harness are non-negotiable; infrastructure can just be your real desktop. That's why this list looks lopsided — most teams bundle a sandbox into their harness (Bytebot) or skip it entirely.

The Model

Both projects ship open-weight vision-language models trained specifically for GUI control. They disagree on size.

UI-TARS

🔗 https://github.com/bytedance/UI-TARS

ByteDance's UI-TARS is a native GUI agent model — pixels in, click coordinates out. No accessibility tree, no DOM parsing. Trained end-to-end on screen recordings, it handles desktop, web, and mobile from one checkpoint.

The strategic move is the kicker: ByteDance also shipped the harness (see UI-TARS-desktop below). Most labs drop weights and let the community figure out the runtime. ByteDance closed the loop themselves.

Best for: teams who want a model purpose-built for screens, not a general VLM coaxed into clicking.

Fara

🔗 https://github.com/microsoft/fara

Microsoft's Fara-7B takes the opposite bet: small. A 7B agentic model that runs on hardware you already have. The thesis — GUI control doesn't need a frontier model, it needs one trained on the right thing.

Once a category gets a "good enough at 7B" model, running it locally becomes realistic. That matters more than usual here: the agent is watching your screen, and shipping every frame to a cloud API is a privacy story nobody wants to tell.

Best for: anyone who wants computer-use to run locally — privacy-sensitive workflows, or just laptops without a constant API bill.

The Harness

Four projects play the harness role — the loop that takes a screenshot, asks a model what to do, dispatches the action, checks the result. They disagree on what to perceive, where to run, and who to trust.

self-operating-computer

🔗 https://github.com/OthersideAI/self-operating-computer

The original. Released November 2023, it was the first widely-used framework that let a multimodal model drive a real computer. The architecture is embarrassingly simple — screenshot, prompt, click — and that simplicity is the point. It proved the loop works.

It's also model-agnostic: GPT-4o, Claude, Gemini, LLaVa, Qwen-VL all plug in. That stance — the model is a swappable component — became the default philosophy for everyone who came after.

Best for: developers who want the canonical reference implementation to read, hack on, or fork.

UI-TARS-desktop

🔗 https://github.com/bytedance/UI-TARS-desktop

The vertical-integration play, and the largest project in the category at nearly 30,000 stars. Native desktop app, MCP support, browser-use integration — the full-fat stack ByteDance built around its own model.

The bet: when the same team trains the model and ships the runtime, the harness exposes exactly the action space the model was trained on. Apple, basically, for computer-use.

Best for: teams who want the most complete out-of-box experience and don't mind picking a side in the model-vs-runtime debate.

Agent-S

🔗 https://github.com/simular-ai/Agent-S

Simular AI's Agent-S is the academic entry — "an open agentic framework that uses computers like a human." It pairs vision with retrieval-augmented planning, so the agent isn't just looking at the screen, it's looking up how to use the app it's looking at.

This is the hybrid perception camp. Pure pixel agents work great until they hit a UI element with no visual affordance. Agent-S falls back on docs, manuals, and prior task experience — a more honest picture of how humans actually drive an unfamiliar app.

Best for: researchers and teams chasing benchmark performance on long-horizon tasks where pure vision falls down.

Bytebot

🔗 https://github.com/bytebot-ai/bytebot

Bytebot is the cleanest pitch among the harnesses: a self-hosted AI desktop agent that operates inside a containerized Linux desktop. Give it natural language, it does the work — in a sandbox, not on your machine.

The dirty secret of computer-use is that letting an agent loose on your real desktop is terrifying. One hallucinated keystroke and your in-progress work is gone. Bytebot's answer: have the agent, just don't give it your laptop.

Best for: anyone who wants to run computer-use agents in production without the "what just happened to my files" problem.

The Infrastructure

One project here, but an important one: the unbundled alternative to ByteDance's vertical stack.

cua

🔗 https://github.com/trycua/cua

Trycua's cua is open-source infrastructure — sandboxes, SDKs, and benchmarks for agents that control full desktops across macOS, Linux, and Windows. Not the agent. The box the agent runs in.

The bet is unbundling. If ByteDance wins by owning model and harness together, cua wins by being the neutral substrate everyone else runs on. BYO model, BYO harness — cua provides the reproducible, multi-OS environment. The AWS of computer-use, if you want the analogy.

Best for: teams building computer-use products who want production-grade sandboxing without writing it themselves.

Closing Thoughts

Three bets sit on the table. ByteDance bets vertical integration wins — train the model, ship the runtime, co-design both. Trycua bets the opposite — own the substrate and let the rest commoditize. Fara is the wildcard: if the model is small enough to ship locally, it can carry its own minimal runtime and the harness shrinks to a footnote.

The technology works. The category is up for grabs. The open question now isn't whether AI can drive a screen — it's which role wins the value: the model, the harness, or the box it all runs in?

On this page

Star History Monthly Apr 2026 | Computer Use