Claude Opus 4.8 Is Here, and the Frontier Just Moved Again
Last updated: May 2026 · By the ModemGuides Editorial Team
Anthropic shipped Claude Opus 4.8 on May 28, 2026, just 41 days after Opus 4.7 and at the same price. The pitch is sharper judgment, more honesty about its own work, and the ability to run unattended for longer than its predecessors. The benchmarks back up most of that.
But before you switch a workflow over to it, it is worth asking the question this site always asks: who controls the thing you are about to depend on? Opus 4.8 is an excellent model. It is also rented, closed, and cloud-only. Here is what actually changed, how it stacks up against GPT-5.5 and Gemini 3.1 Pro, and where it sits next to the open-weight models you can run on your own hardware.
Key takeaways
- Opus 4.8 leads most of Anthropic's published benchmarks at the same price as Opus 4.7, but GPT-5.5 still edges it on agentic terminal coding.
- The headline improvement is honesty: Anthropic reports the model is roughly four times less likely than 4.7 to let flaws in its own code pass without flagging them, which matters a great deal if you run AI agents unattended.
- It remains rented, closed-weight intelligence. For sovereignty you trade some capability for control by running open-weight models locally.
What Actually Changed in Opus 4.8
Sharper judgment and "honesty"
Every Anthropic model is trained to avoid claims it cannot support. The recurring failure across the industry is that models jump to conclusions, confidently reporting progress on thin evidence. Anthropic says Opus 4.8 is around four times less likely than Opus 4.7 to let flaws in code it wrote slip by unremarked, and that it more readily flags its own uncertainty rather than papering over it.
This is the most security-relevant change in the release. The danger with an autonomous agent is not the mistake you catch, it is the mistake it makes quietly while you are not watching, then builds on. A model that surfaces its own doubt and stops to check is a safer thing to hand a long task to. Anthropic's alignment team also reports the model reaches new highs on prosocial measures such as supporting user autonomy and acting in the user's interest, with lower rates of misaligned behavior than 4.7.
Longer unattended runs and Dynamic Workflows
Claude Code gains a research-preview feature called Dynamic Workflows: the model plans a large task, spins up hundreds of parallel subagents in a single session, then verifies its own outputs before finishing. Anthropic says this can drive codebase-scale migrations across hundreds of thousands of lines of code, using an existing test suite as the bar to clear. Claude users on the web and in Cowork also get effort controls, a setting that dials how hard the model works on a given response.
Pricing and access
Opus 4.8 costs $5 per million input tokens and $25 per million output tokens, identical to Opus 4.7. Fast mode runs at roughly 2.5 times the speed and is about three times cheaper than fast mode was on previous models. Developers call it through the Claude API as claude-opus-4-8, and it is available on claude.ai, in Cowork, and through partners such as GitHub Copilot. The "same price" framing is real, but note that effort controls are also a meter: lower effort burns fewer tokens, higher effort costs more. You are tuning a dial that Anthropic, not you, ultimately sets the units on.
Claude Opus 4.8 Benchmarks vs GPT-5.5 and Gemini 3.1 Pro
The figures below are Anthropic's own, drawn from its launch announcement and system card. Because the model launched the same day this was written, no independent third-party evaluations were available yet. Treat these as vendor-reported results until labs like Artificial Analysis or Vals AI publish their own.
| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Agentic coding (SWE-Bench Pro) | 69.2% | 64.3% | 58.6% | 54.2% |
| Agentic terminal coding (Terminal-Bench 2.1) | 74.6% | 66.1% | 78.2% | 70.3% |
| Reasoning, no tools (Humanity's Last Exam) | 49.8% | 46.9% | 41.4% | 44.4% |
| Reasoning, with tools (Humanity's Last Exam) | 57.9% | 54.7% | 52.2% | 51.4% |
| Agentic computer use (OSWorld-Verified) | 83.4% | 82.8% | 78.7% | 76.2% |
| Knowledge work, GDPval-AA (score, higher is better) | 1,890 | 1,753 | 1,769 | 1,314 |
| Agentic financial analysis (Finance Agent v2) | 53.9% | 51.5% | 51.8% | 43.0% |
The honest read: Opus 4.8 tops the table on six of seven measures. The one it does not win is agentic terminal coding, where GPT-5.5 leads 78.2% to 74.6%. Its widest margins are in knowledge work (the GDPval-AA score, where higher is better) and financial analysis. On computer use it improves only slightly over 4.7 on this benchmark, though Anthropic separately cites 84% on the Online-Mind2Web browser-agent test as a high-water mark for the model.
The Catch: You Are Renting Frontier Intelligence
Closed weights, cloud-only, account-dependent
Opus 4.8 lives entirely on Anthropic's servers. You cannot download the weights, you cannot run it offline, and you cannot inspect how it works. Your access depends on an account, terms you did not write, and a price Anthropic can change. That is the core trade of every closed frontier model, and it is the through-line of everything we cover here: whoever controls the infrastructure controls the experience.
What that means in practice
The capability is real and, for the hardest agentic work, genuinely ahead of the open field. But the dependency is real too. If the price moves, your cost moves. If the terms change, your workflow changes. If your account is restricted, your "intelligence" goes dark with it. None of this is a reason to avoid hosted models, plenty of work belongs in the cloud, but it is a reason to know exactly which parts of your stack you actually own.
What You Can Run Locally Instead
Be honest about the gap first. DeepSeek itself names a three-to-six-month lag behind the closed frontier on the hardest cross-domain reasoning, and a model like Opus 4.8 still leads the toughest agentic coding tasks. What has changed is that open weights have closed most of the everyday gap, and they win outright on the one thing a hosted model can never offer: control. For the full picture, see our guide to the best open-source LLMs and the hardware they need and our breakdown of the DeepSeek V4 release, currently the largest open-weight model ever shipped.
The practical ladder by hardware looks like this. With 16 GB of RAM, Gemma 4 26B (Apache 2.0) is the safe default. With 32 GB of unified memory or a 24 GB GPU, Qwen3.6-35B-A3B raises the ceiling on coding and long-context work. With a 128 GB Mac Studio, DeepSeek V4-Flash gets you frontier-class open weights running at home.
ModemGuides may earn a commission from qualifying purchases made through the links below, at no extra cost to you. We only recommend hardware we would build with ourselves.
| Capability | Claude Opus 4.8 | Open-weight models (run locally) |
|---|---|---|
| Download the model | No | Yes |
| Run fully offline | No | Yes |
| Inspect / audit the model | No (closed) | Yes (open weights) |
| Cost model | Per token ($5 / $25 per million) | One-time hardware, then free to run |
| Hardware needed | None (cloud) | 16 GB to 128 GB+ RAM or VRAM |
| Top-tier capability | Frontier (leads most benchmarks) | Strong, ~3 to 6 months behind on the hardest tasks |
The hardware reality is simpler than it looks. The single biggest lever for local AI is memory. A used 24 GB GPU is still the best value for serious local inference, the most flexible mid-range path is a 32 GB mini PC, and Apple's unified-memory machines scale highest of all for large open models. Our mini PC guide for local AI walks through the full setup with Ollama.
For the 24 GB VRAM path, an NVIDIA RTX 3090 remains the value leader:
For a 32 GB mini PC you can leave running quietly, the MINISFORUM UM880 Plus is a solid pick:
For the largest open models, Apple's high-memory desktops scale furthest. You can see current Mac Studio configurations at Apple.
Who should rent, and who should own
Rent Opus 4.8 if you need maximum capability today, your work is heavy agentic coding or analysis, you have no hardware budget, and your data is fine living in the cloud. Own a local stack if you handle private or regulated data, you need offline operation, you want to avoid a per-token bill at scale, or you simply want to inspect and control the model you depend on. For most readers building something durable, the answer is both: a hosted model for the hardest one-off jobs, and a local model for everything you would rather not send away.
Should You Upgrade to Opus 4.8?
If you use Claude Code or build agents that run unattended, the judgment and honesty gains are the real reason to move, not the headline benchmark deltas. Fewer silent failures is worth more than a few points on a leaderboard. If cost is your constraint, the cheaper fast mode and the new effort controls give you levers to pull. And if Opus 4.7 left you underwhelmed, this is the release that addresses it.
One note on timing. The 41-day gap from 4.7 is a fast cadence, and Anthropic has teased its higher-end Mythos-class models for broader release "in the coming weeks." That is a teaser, not a date. Opus 4.8 is shipping today, so there is no reason to stall production work waiting on a model that has no firm arrival.
Frequently Asked Questions
How much does Claude Opus 4.8 cost, and is it free?
Through the API, Opus 4.8 costs $5 per million input tokens and $25 per million output tokens, the same as Opus 4.7. Limited free and subscription access is available through the Claude app and claude.ai, but heavy or programmatic use is billed per token.
Is Claude Opus 4.8 better than GPT-5.5?
On Anthropic's own benchmarks, Opus 4.8 leads on six of seven measures, including agentic coding, reasoning, knowledge work, and financial analysis. GPT-5.5 wins on agentic terminal coding (Terminal-Bench 2.1). "Better" depends on your task, and these are vendor-reported figures pending independent testing.
Can I run Claude Opus 4.8 locally or offline?
No. Opus 4.8 is a closed model that runs only on Anthropic's servers, so it cannot be downloaded or run offline. If local or offline operation matters to you, use open-weight models such as Gemma 4, Qwen3.6, or DeepSeek V4-Flash, which you can run on your own hardware.
How do I access Opus 4.8, and what is the API model name?
Developers use the model string claude-opus-4-8 through the Claude API. It is also available on claude.ai, in Claude Cowork, and through partners including GitHub Copilot.
What is Anthropic Mythos?
Mythos is Anthropic's higher-end model class, more capable than the publicly available Opus line and referenced internally as its strongest model. Anthropic has said Mythos-class models are slated for broader release in the coming weeks, but has not given a firm date.
What do the new effort controls do, and do they save money?
Effort controls on claude.ai and Cowork let you set how much reasoning effort Claude puts into a response. Higher effort means deeper, slower work; lower effort is faster and uses fewer tokens, which can reduce cost on tasks that do not need maximum depth.
What changed about safety and alignment in Opus 4.8?
Anthropic reports that Opus 4.8 is roughly four times less likely than 4.7 to let flaws in its own code pass unremarked, flags its uncertainty more readily, and shows lower rates of misaligned behavior. Its alignment team says the model reaches new highs on prosocial traits like supporting user autonomy.

