Last updated: June 2026
Key Takeaways
- Sakana Fugu is an orchestration system, not a model: it routes your request across a hidden pool of other companies' models and returns one answer. Fugu Ultra posts frontier-competitive numbers, but every figure is Sakana's own and none has been independently verified.
- Against the models you can actually buy, Fugu Ultra leads on several coding and reasoning tests. Against Claude Fable 5, the model Sakana says it matches, Fable 5 still wins three of the four head-to-head benchmarks Sakana published, and Fable 5 is export-restricted and not in Fugu's pool, so "matching" it is a claim about a substitute, not the real model.
- Fugu is a genuine capability and resilience product for some work, but it is not privacy or sovereignty: you cannot self-host it, cannot see which models touch your prompt, and cannot opt out on Ultra. If you want capability nobody can revoke or inspect, that is open-weight models on hardware you own.
When the US pulled Claude Fable 5 and Mythos 5 offline by export-control order on June 12, the most capable models most people could reach vanished overnight. Days later, Tokyo's Sakana AI launched Fugu with a pointed pitch: frontier performance that matches Fable and Mythos, through one API, with no export-control exposure. For anyone who lost access, the obvious question is whether that is true.
The short version: Fugu is real, genuinely strong on specific work, and does beat every frontier model you can currently buy on several benchmarks. But on Sakana's own numbers it generally trails Fable 5 itself, and "matching" a model it cannot include and you cannot access is a narrower claim than the headlines suggest. Here is the honest read on the benchmarks, the catch inside them, and whether Fugu actually rivals the frontier.
What Sakana says Fugu matches
Sakana's exact claim is that Fugu Ultra stands "shoulder-to-shoulder with Fable 5 and Mythos Preview," delivering "frontier capability without the risk of export controls." It is worth noting Sakana is careful here: it says shoulder-to-shoulder, not "beats." The louder secondhand coverage, framing Fugu as topping GPT-5.5, Gemini, and Opus across the board, runs ahead of the more restrained claim Sakana actually made about Fable.
What you are measuring also matters. Fugu is not a model in the usual sense. It is an orchestrator: a language model trained to take your request, hand pieces to a pool of other large language models, verify their work, and synthesize one answer, all behind an OpenAI-compatible API. It ships as Fugu and the higher-accuracy Fugu Ultra, and it is grounded in two ICLR 2026 papers, TRINITY and The Conductor. The capability on the leaderboard is coordination, not a single model's weights.
Fugu Ultra vs Claude Fable 5: the head-to-head
On the benchmarks where Sakana published both, Fable 5 still leads more often than not. Here is the direct comparison.
| Benchmark | Fugu Ultra | Claude Fable 5 | Leader |
|---|---|---|---|
| SWE-Bench Pro (agentic coding) | 73.7 | 80.3 | Fable 5 |
| TerminalBench 2.1 | 82.1 | 88.0 | Fable 5 |
| LiveCodeBench | 93.2 | 89.8 | Fugu Ultra |
| Humanity's Last Exam | 50.0 | 53.3 | Fable 5 |
Source: Sakana AI's published benchmark chart, June 2026. Fugu Ultra figures are Sakana's own; Claude Fable 5 figures were originally provider-reported by Anthropic and carried into Sakana's chart. SWE-Bench Pro used mini-swe-agent scaffolding for Fugu. Cross-vendor methodology differs, so read these as directional. Fable 5 is currently unavailable to the public under the June 12 export-control order.
Fable 5 takes SWE-Bench Pro and TerminalBench 2.1, the two agentic-coding tests, plus Humanity's Last Exam. Fugu Ultra takes LiveCodeBench. So "shoulder-to-shoulder" is fair shorthand, the gaps are single digits, but it is not "as good as," and on the hardest software-engineering work the model Fugu names as its peer is measurably ahead. Both columns are vendor numbers gathered under different conditions, so even this comparison is directional rather than settled.
Fugu vs the frontier you can actually use
The more practical comparison for most people is against the models still available, because Fable 5 is gone. Here Fugu's case is stronger.
| Benchmark | Fugu | Fugu Ultra | Opus 4.8 | Gemini 3.1 Pro | GPT-5.5 |
|---|---|---|---|---|---|
| SWE-Bench Pro | 59.0 | 73.7 | 69.2 | 54.2 | 58.6 |
| TerminalBench 2.1 | 80.2 | 82.1 | 74.6 | 70.3 | 78.2 |
| LiveCodeBench | 92.9 | 93.2 | 87.8 | 88.5 | 85.3 |
| Humanity's Last Exam | 47.2 | 50.0 | 49.8 | 44.4 | 41.4 |
| GPQA-Diamond | 95.5 | 95.5 | 92.0 | 94.3 | 93.6 |
| MRCRv2 (long-context recall) | 86.6 | 93.6 | 87.9 | 84.9 | 94.8 |
Source: Sakana AI Fugu benchmark report, June 2026. Fugu scores are Sakana's own; baseline scores are provider-reported. Treat as directional until independent evaluations land.
On Sakana's numbers, Fugu Ultra leads Opus 4.8, GPT-5.5, and Gemini 3.1 Pro on SWE-Bench Pro, TerminalBench 2.1, LiveCodeBench, and GPQA-Diamond, and essentially ties Opus 4.8 on Humanity's Last Exam at 50.0 to 49.8. It is not a clean sweep: GPT-5.5 wins long-context recall on MRCRv2, 94.8 to 93.6. VentureBeat's read is the right frame: an orchestration system is bounded by the models in its pool, and the results are strong but not a silver bullet. Independent replication has not happened yet, so treat the numbers as a credible vendor self-report, not settled fact.
The catch the benchmarks hide
The headline claim has a structural asymmetry worth stating plainly. Fable 5 and Mythos are not in Fugu's pool, because they are export-restricted and not publicly accessible. Fugu cannot route to them, so "matching" them is a comparison against models it explicitly excludes. You do not get Fable 5 by using Fugu. You get Fugu's coordination of other models, measured against Fable 5's published scores.
That matters for the buying decision. If a model in Fugu's pool gets its own restriction or price change, your results shift with it. The orchestration does smooth over any single vendor disappearing, which is the real benefit, but it is bounded by whatever it can currently reach. For the model that actually vanished, see our Fable 5 and Mythos 5 breakdown, and for the other prominent compound answer to the same gap, our OpenRouter Fusion analysis.
What you are actually buying, and where it is strong
Away from the leaderboard, Fugu is a real product with real strengths. Early users report it surfacing far more issues in code review than single models do, running multi-hour autonomous research and paper-reproduction tasks, and holding a stable persona across long sessions where other models drift. For teams that want frontier-class output without betting everything on one provider staying available, the resilience argument is legitimate. Pricing, per Sakana's pricing page as of June 2026, runs from a 20-dollar monthly tier up to 200 dollars, with per-token billing for Fugu Ultra, roughly in line with the major model APIs.
The honest framing is that Fugu competes on coordination and resilience, not on owning a stronger base model. For workloads where you do not know in advance which model is best per subtask, that is a genuine convenience. If your task and your model are already fixed, calling that model's API directly is simpler and usually cheaper.
What the benchmarks do not tell you: access, data, and control
Benchmarks measure capability. They say nothing about who controls the system or where your data goes, and on those axes Fugu's pitch needs a second look. You cannot self-host it; it is a closed, cloud-only API with no published weights. You cannot see which models handled any given query, because Sakana states the routing is proprietary and hidden by design. On Fugu Ultra you cannot opt any provider out of the pool. Your usage trains Sakana's models unless you opt out, and the service is unavailable in the EU and EEA on GDPR grounds.
That is the gap between "no export controls" and "sovereignty." Routing around any single vendor disappearing is real resilience. It is not control. You have replaced one dependency with a dependency on Sakana plus an undisclosed, rotating set of other companies, and you have less visibility than before. The through-line of everything we cover is that whoever controls the infrastructure controls the experience, and a rented orchestrator of rented models does not put that control with you.
The configuration that does is the unglamorous one: open-weight models on hardware you own, where the weights cannot be revoked, the data never leaves your network, and you can read exactly what the system is. It gives up some frontier capability and asks more of you up front. Our best hardware for local AI guide and best local models by VRAM breakdown cover what each budget realistically runs, and the mini PC guide covers a compact, always-on box to run it on.
So does Sakana Fugu rival Fable 5?
Close, and genuinely impressive for an orchestration layer, but a notch behind Fable 5 on the hardest tests by Sakana's own numbers, ahead of every frontier model you can currently buy, and entirely unverified by anyone but Sakana so far. "Shoulder-to-shoulder" is fair. "As good as Fable 5" overstates it. And either way, Fugu is not the restricted model; it is a different stack measured against it.
If you want the best generally available coordination today and you value resilience against a vendor vanishing, Fugu is a serious option, with the privacy and disclosure tradeoffs above. If you handle sensitive or regulated data, or you want capability nobody can revoke or inspect, the orchestration model is the wrong tool and a local stack is the right one.
Frequently Asked Questions
Does Sakana Fugu really match Claude Fable 5?
Close, on Sakana's own benchmarks, but Fable 5 still leads three of the four head-to-head tests Sakana published: SWE-Bench Pro, TerminalBench 2.1, and Humanity's Last Exam. Fugu Ultra wins LiveCodeBench. Sakana claims "shoulder-to-shoulder," not "beats," and Fable 5 is not in Fugu's pool, so it is a comparison against a model Fugu cannot actually use.
Is Sakana Fugu better than Claude, GPT-5.5, or Gemini?
On Sakana's numbers, Fugu Ultra leads Opus 4.8, GPT-5.5, and Gemini 3.1 Pro on several coding and reasoning tests, and ties Opus 4.8 on Humanity's Last Exam. It is not a sweep: GPT-5.5 wins long-context recall on MRCRv2. All figures are vendor-reported, with no independent evaluation yet.
Are Sakana Fugu's benchmarks real, and can I trust them?
They are Sakana's own results with provider-reported baselines, and as of late June 2026 there is no third-party verification. The scores are credible but unconfirmed, so treat them as directional rather than settled.
Is Sakana Fugu open source?
No. Fugu is a closed, cloud-only commercial product. The research behind it, the TRINITY and Conductor papers, is published, but the orchestrator, the pool composition, and the training details are proprietary.
Can I run Sakana Fugu locally or self-host it?
No. Fugu runs only on Sakana's cloud and has no offline mode. The nearest local path is running open-weight models such as Gemma 4, Qwen, or DeepSeek through Ollama on your own hardware. Our local AI hardware guide covers what each budget realistically buys.
Is my data private with Sakana Fugu?
Only partly. A single prompt may be processed by more than one provider in Fugu's pool, you cannot see which, there is no provider opt-out on Fugu Ultra, and your usage trains Sakana's models unless you opt out. Fugu is also unavailable in the EU and EEA on GDPR grounds. For sensitive or regulated data, a local model that never sends your prompt off your network is the safer route.

