Last updated: June 2026
Key Takeaways
- The viral "3x faster than an RTX 5080" benchmark is a capacity result, not a speed win. The Ryzen AI Max+ 395 runs models a 16GB GPU physically cannot load, and it runs them slowly.
- The $1,499 box is the 64GB configuration. Running a 235-billion-parameter model needs the 128GB version, which costs around $2,200 as of mid-June 2026, not $1,499.
- It is a private, no-meter machine for bulk and chat-style work, not a drop-in replacement for a frontier subscription or a fast coding agent. Match the box to the job before you buy.
A lunchbox-sized PC that runs a 235-billion-parameter model for the price of a mid-range GPU is a genuinely interesting product. The thread that made it go viral is also wrong about the parts that decide whether you should buy one. Here is the honest version: what this machine runs, how fast, what it costs once you buy the right configuration, and whether it can replace the AI subscriptions you already pay for.
The benchmark that broke the room is a capacity test
The line everyone screenshotted, that the Ryzen AI Max+ 395 beat an RTX 5080 by more than 3x on DeepSeek R1, is a real number from a real test. It also tells you almost nothing about speed.
Call it the capacity-versus-speed swap: a benchmark wins on whether a model fits, then gets reported as if it won on speed. AMD's own materials frame the figure as "up to 3.05x," and it only appears once the model exceeds the RTX 5080's 16GB of memory.
The mechanism is simple. A full-size DeepSeek R1 does not fit in 16GB of VRAM. When a model overflows a GPU's memory, the extra layers spill into system RAM and crawl across the PCIe bus, far slower than the card's own memory. The 5080 did not lose because it is slow; it lost because it was reading weights through a straw. The AMD box won because all 128GB sits in one pool and nothing had to spill.
Run a model that fits inside 16GB and the result flips. The 5080 pulls far ahead, because token generation is governed by memory bandwidth, and that gap is not close.
| Machine | Memory for models | Memory bandwidth (peak) |
|---|---|---|
| Ryzen AI Max+ 395 (128GB) | Up to ~110GB (Linux) | ~256 GB/s |
| RTX 5080 | 16GB VRAM | ~960 GB/s |
| RTX 4090 | 24GB VRAM | ~1,008 GB/s |
| RTX 5090 | 32GB VRAM | ~1,792 GB/s |
| Mac Studio M3 Ultra | Up to 256GB unified | 819 GB/s |
Figures are theoretical peaks. Strix Halo's real-world measured bandwidth lands around 210–220 GB/s, roughly a quarter of an RTX 4090's rate. That sets the ceiling on tokens per second, and no software trick gets around it.
The $1,499 box is not the box in the photo
The thread says $1,499. The photo shows a box running a 235-billion-parameter model. Those are not the same box.
The most widely available Ryzen AI Max+ 395 mini PC is the GMKtec EVO-X2, and the $1,499 listing is the 64GB / 1TB configuration. A 235-billion-parameter model, even a Mixture-of-Experts one, does not fit in 64GB, and neither does a dense 70B at a sane quantization. The box that ran the demo is the 128GB version.
That one is not $1,499. Pricing moves week to week, but as of mid-June 2026 the 128GB EVO-X2 lands around $2,199 (the figure Tom's Hardware logged for the GMKtec-direct 128GB/2TB unit), with retail listings nearer $2,299 and occasional promotions around $1,999. Budget about $2,200.
AMD now sells its own first-party version, the Ryzen AI Halo developer PC, through Micro Center at $3,999, the identical Strix Halo silicon and 128GB plus a developer-program bundle, for roughly $1,800 more. NVIDIA's competing DGX Spark, also 128GB, was raised to $4,699 in late February 2026, citing memory supply. If your goal is the machine in the photo, your real number is about $2,200. Decide your largest target model first, then buy the configuration that holds it: 70B wants 96GB or more, a 235B MoE wants 128GB.
Getting the full memory is a Linux job
"On Linux you get about 110GB of usable graphics memory" is true. It does not happen by itself, it never reaches that level on Windows, and the steps have shifted since the early guides.
On a unified-memory APU there is no separate VRAM chip; the system carves a slice of the shared pool for the GPU. Two limits govern how large that pool can grow: a BIOS carve-out, which AMD calls Variable Graphics Memory and caps at 96GB, and, on Linux, the kernel's GTT limit, which lets the GPU address ordinary system memory. Windows tops out at the 96GB allocation. Linux, through the GTT path, reaches about 110GB and beyond, which is why serious users run these boxes on Ubuntu or Fedora.
The counterintuitive part: early guides told you to max out the BIOS carve-out. For Linux LLM use, the consensus is now the opposite. Set it to the minimum (512MB) and let the GTT limit hand the GPU the rest of the pool dynamically, with no measured speed penalty. You also want a recent kernel; 6.16.9 and later fixes a bug where ROCm saw only about 15.5GB despite a correct allocation.
One verification gotcha: on a unified APU, rocm-smi reports tiny VRAM by design. Check the GTT total instead:
cat /sys/class/drm/card*/device/mem_info_gtt_total # bytes; want >100GB
If it reads north of 100GB after reboot, the memory is there; if it shows only a few gigabytes, your BIOS or kernel line did not take. Fix that before you blame the model.
What it actually runs, and how fast
With about 110GB you can load almost anything. Bandwidth decides whether you will enjoy it. At roughly 256 GB/s, the rule is straightforward: Mixture-of-Experts models punch far above their weight, and dense models hit a wall proportional to their size.
The 235B model everyone screenshotted is Qwen3-235B-A22B, a Mixture-of-Experts design: 235 billion total parameters, but only about 22 billion active on any given token. The chip streams the active experts, not the whole model, which is the only reason a box this slow can touch a model this big. GMKtec's published figure for the 128GB EVO-X2 is about 11 tokens per second on it, roughly comfortable reading speed.
| Model | Type | Approx. tokens/sec |
|---|---|---|
| Qwen3-235B-A22B | MoE (~22B active) | ~11 |
| Llama 3.3 70B (Q4) | Dense | ~5 |
| Qwen3-Coder 30B / 30B-A3B | MoE | ~70–100 |
| 7B–13B models | Dense | ~30–45 |
Figures are approximate and depend on quantization, runtime, and context length. MoE rates reflect active parameters, not total. These are generation speeds; prefill on long prompts is slower.
Dense models tell the opposite story. A 70B model at Q4 is about 40GB of weights streamed on every token, landing it in the single digits, around 5 tokens per second. Smaller 7B to 13B models run at 30 to 45, fast enough to feel instant. The 30B-class MoE models, such as Qwen3-30B-A3B and Qwen3-Coder 30B, have improved sharply on current llama.cpp Vulkan and ROCm builds, with community benchmarks now reporting 70 to 100. For a fuller breakdown of which open-weight models are worth running and on what hardware, see our guide to the best open-source LLMs and the hardware they need.
Two caveats the demos skip. AMD's own headline figures, up to 61 tokens per second on Phi-3.5 and a claimed 1.7x better tokens-per-dollar than a DGX Spark, are vendor self-reported; independent community testing shows an RTX 4090 running an 8B model several times faster on models that fit its 24GB. And none of these numbers count prefill: feeding a long prompt before the first token appears is genuinely slow, taking seconds to minutes on a large context.
Can it replace your Claude Code subscription?
This is the question most prospective buyers actually have, and the honest answer is a qualified "partly."
Pointing Claude Code at a local model used to require a translation proxy. It is now native. On January 16, 2026, Ollama v0.14.0 shipped an Anthropic-compatible Messages API endpoint, the same protocol Claude Code uses to reach Anthropic's servers. LM Studio followed on January 30 with the same endpoint in version 0.4.1. The connection is two environment variables:
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
claude --model qwen3-coder
Nothing leaves the machine and nothing meters per request. The privacy case is real: most developers now use AI coding tools, and many are uneasy about sending proprietary code to the cloud. A local agent removes that exposure entirely.
Three honest sharp edges, though. First, pull a tool-capable model, Qwen3-Coder or a GLM-4.x build, not a stock instruct model, which looks fine in a chat box and silently fails the moment the agent tries to edit a file. Second, raise the context window; Claude Code's system prompt is large, so 32K tokens is the floor and 64K the practical setting. Third, and most important, speed. There is no prompt caching on these endpoints, so every turn reprocesses the full system prompt and conversation from scratch, and Ollama's compatibility layer currently ignores tool_choice, which can send the agent into the wrong tool and a loop.
At around 11 tokens per second on a large model, that adds up. A coding agent fires many model calls per task, each waiting on the last, so the result is fine for a conversation and painful for fast, iterative work. If you go local, keep the model terse: a short CLAUDE.md telling it to skip preamble, edit the fewest lines possible, and cap output length keeps the token cost from stalling you. The realistic framing is not "cancel your subscription." It is "move the bulk, private, no-deadline work here, and keep a frontier subscription for the part that needs it."
The money math, honestly
The viral pitch goes like this: a stack of AI subscriptions runs about $440 a month, $5,280 a year, and "the box pays for itself in nine months." Those figures are plausible for a heavy user. The payback number is where it quietly cheats.
If a roughly $2,200 box fully replaced $440 a month, payback would be about five months, not nine. The nine-month figure only works if you do not fully replace your stack, and you will not. A local open-weight model here is strong, but it is not a frontier closed model on the hardest reasoning, and you will feel that gap on your most demanding work. The realistic move offloads the high-volume, privacy-sensitive, no-deadline work to the box and keeps one frontier subscription for the 10% that needs it: a longer payback, and a box that then runs without a meter.
There is a second reason the cheap framing misleads, and it is the one most relevant in 2026: memory. The box costs about $2,200, not $1,499, partly because high-capacity memory is expensive right now. The same DRAM crunch raised NVIDIA's DGX Spark to $4,699 and pushed Apple to drop the 512GB option from the M3 Ultra entirely, capping it at 256GB. We walked through the full picture in our DeepSeek V4-Flash hardware reality check, and the same math applies here: when memory is this expensive, "just buy the big box" is a real decision, not an impulse.
Who should buy it, and who should not
This box wins on exactly one axis: running models too large for a consumer GPU, privately, with no per-token meter, at a speed you can tolerate. If that is your situation, it is the cheapest path to fast unified memory, and it does something no single consumer GPU can.
It is the wrong buy for two common cases. If the models you run fit in 24 to 32GB of VRAM, a used RTX 3090, 4090, or 5090 will run them several times faster for similar money, so buy the GPU. And if you need frontier-grade reasoning every day, no local setup matches it yet; keep paying for the subscription that does. Our best hardware for local AI guide and our mini PC roundup break down the GPU and lower-cost options if either of those is you.
Check Price on Amazon: RTX 3090 24GB (Renewed)
Before you spend $2,200, spend ten minutes. Install Ollama on a machine you already own, pull Qwen3-Coder, set those two environment variables, point Claude Code at localhost, and watch your real tokens per second crawl by. If 11 a second feels fine, buy the 128GB box, not the $1,499 one.
Frequently Asked Questions
Can the $1,499 EVO-X2 run a 235B model?
No. The $1,499 listing is the 64GB configuration, and a 235-billion-parameter model, even a Mixture-of-Experts one, will not fit. You need the 128GB version, which costs around $2,200.
How many tokens per second does the Ryzen AI Max+ 395 get?
It depends on the model. A 235B MoE model runs at about 11 tokens per second, a dense 70B at roughly 5, and 7B to 13B models at 30 to 45. The 30B-class MoE models reach 70 to 100 on current builds. Generation is limited by memory bandwidth, around 256 GB/s.
Is it faster than an RTX 5080 or 5090?
No, for any model that fits on the GPU. The viral "3x" result only happens when a model is too large for the GPU's 16GB and has to spill into system RAM. For models that fit, a 5080 or 5090 is several times faster because it has far more memory bandwidth.
Can I run Claude Code on it?
Yes. Since Ollama v0.14 (January 2026) and LM Studio 0.4.1, Claude Code can point at a local model with two environment variables and no proxy. It works, but at local speeds it suits chat-style and single-shot tasks better than fast, multi-call agent loops. Use a tool-capable model like Qwen3-Coder.
Strix Halo or a Mac Studio for local LLMs?
The current Mac to weigh is the M3 Ultra, which has about 819 GB/s of bandwidth, roughly three times this box, and up to 256GB of unified memory, but it costs more. The Strix Halo box wins on price per gigabyte of fast memory and on form factor. An M5 Ultra is expected around October 2026.
Do I need Linux to get the full memory?
To reach about 110GB of usable graphics memory, yes. Windows caps the GPU's pool at AMD's 96GB Variable Graphics Memory allocation, while Linux's GTT path lets the GPU address more of the shared memory, which is why most heavy users run Ubuntu or Fedora on these machines.

