Running DeepSeek V4-Flash Locally: The Hardware Reality Check

MIT-licensed and free to download, but all 284 billion parameters still need somewhere to live. Where the local V4-Flash line actually sits in 2026.

Updated on

Last updated: June 2026

Key Takeaways

  • V4-Flash activates only 13 billion parameters per token, but all 284 billion must sit in memory. The smallest working quantized build is 81GB.
  • No stable release of llama.cpp, Ollama, or LM Studio can load the V4 architecture yet; every consumer run today depends on experimental community forks.
  • The realistic entry point is 96GB of VRAM or 128GB of unified memory. Most builders should run smaller models for now and watch for the two signals below.

On paper, DeepSeek V4-Flash is the release local AI builders have spent two years asking for: full open weights under a standard MIT license, near-frontier benchmark results, and a download price of zero. One community quantization repository on Hugging Face logged roughly 99,000 downloads in a month. The question is what happens after the download finishes.

The short version: "Flash" describes the model's inference cost, not its size, and the software needed to run it on consumer hardware exists only in experimental forks as of June 2026. We covered the release itself in our launch-day coverage; this is the follow-up for anyone deciding whether their own machine can run it, with the honest numbers.

What V4-Flash Actually Is, and the 13B Misconception

DeepSeek released the V4 family on April 23, 2026 as a two-model preview: V4-Pro at 1.6 trillion total parameters and V4-Flash at 284 billion. Flash is the one that matters at home; its specification sheet explains both the excitement and the problem.

Specification DeepSeek V4-Flash
Total parameters 284 billion
Active parameters per token 13 billion
Architecture Mixture-of-Experts with hybrid compressed sparse attention (CSA + HCA)
Context window 1 million tokens
Native precision FP4 weights, FP8 KV cache
Reasoning modes Non-think, Think High, Think Max
License MIT, unmodified
Launch inference support vLLM and SGLang

The misconception to kill immediately: "13 billion active" does not mean V4-Flash behaves like a 13B model on your hardware. Active parameters set the compute per token, which is why DeepSeek can serve it cheaply. Total parameters set the memory bill: the router can call any expert on any token, so every expert has to be resident. A 13B dense model needs about 8GB at 4-bit. V4-Flash at comparable precision needs roughly twenty times that.

The capability itself is genuine. DeepSeek reports 79.0 percent on SWE-bench Verified, and Artificial Analysis placed Flash within a few points of the far larger V4-Pro on its intelligence index. The MIT license is the real, unmodified text: the legal side of self-hosting is fully solved. The physics side is not.

The Memory Math Nobody Puts in the Headline

Quantization is the only lever that shrinks the memory bill, and the community has pulled it hard since April. These are the working GGUF builds available as of mid-June, with actual file sizes, not theoretical minimums:

Quantization File Size Memory Tier It Targets
Q4_K_M 172 GB 192GB-plus unified memory or multi-GPU servers
FP4 + FP8 native mix ~146 GB 192GB unified memory with headroom
Q3_K_M 136 GB 192GB unified memory
Q2_K 103 GB 128GB unified memory, with no room to spare
IQ2_XS-XL 81 GB 96GB dual-GPU workstations; 128GB unified with context room

Sizes from community Hugging Face repositories, June 2026. Quantizations change weekly; confirm file size before committing to a 100GB-plus download.

The table needs two footnotes. The KV cache first: the one-million-token context window is real but priced for datacenters. At home, plan on 8K to 32K tokens of context and budget 10 to 20 percent on top of the file size for cache, runtime, and operating system, an overhead the FP8 cache keeps manageable but never erases.

Then the quality floor. Below roughly 3 bits per weight, output quality degrades noticeably. The builds that make the 128GB tier possible dodge this with mixed precision: routed experts at 2-bit, with attention, shared experts, and output layers kept at 8-bit. That compromise is also why the most interesting work is happening in one specific fork.

The Software Gap: Where Support Actually Stands

V4 launched with day-one support in vLLM and SGLang, both datacenter frameworks built for multi-GPU servers; vLLM 0.21 has since stabilized V4 on Blackwell hardware. The home stack is a different story. llama.cpp, and everything built on it, including Ollama and LM Studio, still cannot load the V4 architecture in any stable release.

Support exists, but in a work-in-progress branch and a handful of forks that add what upstream lacks: the hybrid attention decode path, the expert-routing indexer, and FP8 cache handling. The quantized files are pinned to those forks; download a V4-Flash GGUF and the model card tells you that stock llama.cpp will refuse to load it.

The most notable fork comes from Salvatore Sanfilippo, the creator of Redis, who published an experimental llama.cpp build targeting 128GB MacBooks with the 2-bit expert mix above, then spun the work into ds4, a standalone Metal runtime scoped to one model. He describes the chat quality as having frontier-model character while cautioning it is not extensively tested. When running a model requires a personal runtime from one of open source's most respected engineers, "experimental" is the precise word.

The fork benchmarks also broke standard quantization intuition: V4-Flash's decode speed is limited by compute in its routing and indexing kernels, not memory bandwidth. One quant maintainer measured the 8-bit build decoding faster than the 4-bit build, because its dequantization path is simpler. The smallest file is not automatically the fastest.

The setup today, roughly:

git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
./build/bin/llama-server \
  --model DeepSeek-V4-Flash-IQ2_XS-XL-00001-of-00002.gguf \
  --ctx-size 8192 --n-gpu-layers 999 --flash-attn on

Read that as a temperature reading, not a tutorial. There is no Ollama pull, no LM Studio listing, no one-click anything: you compile a fork, match your quantized file to it, and troubleshoot from there. AMD's Strix Halo machines add a ROCm workaround on top, to route around a crash on the chip's GPU. The path works. It is not consumer-ready.

Tier by Tier: Can Your Machine Run It?

Your Hardware Best Available Build Reported Speed Verdict
32GB mini PC or laptop None fits Not applicable No. Run the small-MoE class instead.
Gaming GPU, 24 to 32GB VRAM Fork build with experts offloaded to system RAM Single-digit tokens per second Not practical
96GB workstation GPU, single or dual-card IQ2_XS-XL at 81GB, fully on GPU 8 to 12 tokens per second Runs, on an expert-mode build
128GB unified memory (M4 Max, Ryzen AI Max+ 395, DGX Spark) 2-bit expert mixes via forks Usable chat speeds; early reports vary The true borderline
192 to 256GB Mac Studio (M3/M4 Ultra) Q3_K_M to Q4_K_M, or MLX 4-bit ~25 tokens per second The one comfortable tier; see availability note

Speeds are early community reports on experimental builds, June 2026. Ballparks, not benchmarks.

Start with the tier most readers are in. Even an RTX 5090's 32GB of VRAM is 49GB short of the smallest working build. Expert-offload flags can park routed experts in system RAM, and it does run, at speeds nobody would use twice, on a box that also needs 128GB of system RAM. A 24GB card like the RTX 3090 remains an excellent local AI GPU, as our local AI hardware guide covers in depth; it is simply built for other models.

The 96GB tier is where V4-Flash first runs entirely on GPU. The 81GB IQ2_XS-XL build genuinely fits, on one RTX PRO 6000 or two 48GB cards, and the 8 to 12 tokens per second figure comes from testing on that exact hardware. That is workstation silicon at workstation prices, running the lowest viable quality tier slower than the hosted API serves it. Possible is the right word. Sensible is a stretch.

The 128GB unified tier is where the interesting fight is happening. Apple Silicon has the Sanfilippo fork and an MLX path. AMD's Strix Halo boxes work with the ROCm caveat. NVIDIA's DGX Spark needed its own patch within days. All three routes share the same profile: 2-bit expert quality, experimental software, and results that impress because they should not be possible.

The 192GB-plus tier is the only one that resembles normal local AI. Q4-class builds fit with headroom, and early MLX testing on a 192GB M3 Ultra reported around 25 tokens per second, comfortable for solo work. The 2026 problem is buying one: Apple pulled its highest-memory Mac Studio configurations from sale this spring as the shortage bit, and the M5 generation skipped WWDC entirely. The one comfortable tier is also, for now, the hard-to-buy tier.

The RAM Crisis Closed the Escape Hatch

The consolation answer for MoE models too big for VRAM has always been: leave attention on the GPU, push the experts into system RAM, accept slower output. The technique still works. The economics of 2026 do not. Tom's Hardware's RAM price index recorded the cheapest in-stock 32GB DDR5 kit at $374.97 on June 3, capacity that sold for $80 to $120 a year earlier, and 64GB kits now commonly list above $600.

At those prices, the 128GB of system memory an offload build wants costs more than a used 24GB GPU that runs 30B-class models in VRAM, fast. The math that once made "just add RAM" the budget path now points the other way, and analysts are not forecasting relief this year. We covered the full picture, including the build strategies that still hold up, in our RAM shortage explainer.

What to Run Instead, and the Two Signals to Watch

If you came to local AI for privacy and control, you do not need 284 billion parameters to get it. The current generation of open models is unusually kind to modest hardware, as our open-source LLM and hardware guide lays out, and the right move below 96GB is to run what fits well rather than what barely loads.

On 16GB to 32GB machines, the 2026 small-MoE class, Gemma 4's 26B mixture among them, runs at conversational speed with no GPU; our mini PC guide for local AI maps the tiers. With a 24GB card, 30B-class models run entirely in VRAM and 70B-class models run quantized. The renewed RTX 3090 is still the best value per gigabyte of VRAM for that job.

Check Price on Amazon: NVIDIA RTX 3090 24GB (Renewed)

At 64GB of unified memory, you can hold the strongest of the small MoE generation with full working context and let the box double as a home server. Machines that ship with memory installed have also sidestepped the worst of the RAM market, their components bought on contracts signed before the squeeze.

Check Price on Amazon: MINISFORUM AI X1 Pro 370 (64GB)

The hosted route is fast and cheap. The tradeoff is the one this site exists to flag: your prompts leave your network and are processed under the provider's privacy policy and jurisdiction, which for DeepSeek's own API means servers in China. For some workloads that is acceptable; for the documents that made you want local AI in the first place, it is the entire question. Open weights mean the local option stays permanently yours to exercise. Today's hardware just means most people cannot exercise it yet.

Two events will mark the turn. First, the V4 architecture merging into mainline llama.cpp; Ollama and LM Studio support typically follows within weeks. Second, major quantization publishers shipping turnkey, calibrated builds instead of fork-pinned conversions. Either one moves V4-Flash from expert project to weekend project. Both are worth waiting for before spending money.

What V4-Flash Signals Even If You Never Run It

Step back from the install friction and the release still matters, for one number: 13 billion active parameters delivering near-frontier results. Two years ago that quality required dense models no consumer machine could load at any quantization. V4-Flash is the strongest evidence yet that the active-parameter floor keeps falling, and that the gap between "frontier-adjacent" and "fits in a 32GB box" is now an engineering problem, not a research problem.

The pattern is already visible one size class down, where this spring's small-model releases shipped with day-one runtime support on hardware people already own. Expect the next Flash-class model, or the one after it, to land runnable on launch day. Whoever controls the infrastructure controls the experience, and an MIT-licensed, frontier-adjacent model is the door staying open, even in a year when the hardware to walk through it got harder to buy.

Frequently Asked Questions

Can an RTX 4090 or RTX 5090 run DeepSeek V4-Flash?

Not in any practical sense. The smallest working build is 81GB, far beyond 24 or 32GB of VRAM. Forks can offload experts to system RAM, but reported speeds are single-digit tokens per second, and that much RAM is expensive in 2026. These cards are better spent on 30B and 70B-class models.

Does Ollama or LM Studio support V4-Flash?

Not in any stable release as of June 2026. Both depend on llama.cpp, which has not merged the V4 architecture; support lives in a work-in-progress branch and community forks. When the merge lands upstream, Ollama and LM Studio should follow within weeks, making that merge the best signal to watch.

How much memory does the 4-bit version actually need?

The Q4_K_M file is 172GB, and the native FP4 mixed build is about 146GB. Add 10 to 20 percent for cache, runtime, and operating system, and you are realistically in 192GB-plus territory for a comfortable 4-bit run.

Is the MIT license really safe for commercial use?

Yes. V4-Flash ships under the standard, unmodified MIT license, which permits commercial use, modification, and fine-tuning. Not every 2026 open-weight release works this way: MiniMax M2.7, for example, uses a modified MIT license with commercial-use conditions that deserve a careful read before any deployment.

Will a 128GB MacBook Pro or Mac Studio run it?

A qualified yes; this is the genuine borderline. The experimental fork from Redis creator Salvatore Sanfilippo targets exactly this tier with 2-bit expert quantization, and early reports describe usable chat quality. Expect experimental software, quality tradeoffs, and a tinkering project rather than an appliance.

Should I wait for better support or just use the API?

If your goal is privacy and data sovereignty, waiting costs you nothing: run a model that fits your hardware today, and revisit V4-Flash when llama.cpp support merges upstream and turnkey quantizations ship. The hosted API is fast and cheap, but your data is processed on the provider's servers under its jurisdiction, which is what local AI exists to avoid.

USA-Based Modem & Router Technical Support Expert

Our entirely USA-based team of technicians each have over a decade of experience in assisting with installing modems and routers. We are so excited that you chose us to help you stop paying equipment rental fees to the mega-corporations that supply us with internet service.

Updated on

Leave a comment

Please note, comments need to be approved before they are published.