Last updated: April 2026
Key Takeaways
- DeepSeek V4-Pro (1.6 trillion parameters / 49 billion active) is now the largest open-weight LLM ever released, shipping under the MIT License alongside the smaller V4-Flash (284 billion / 13 billion active). Both support a 1 million token context window.
- Pricing is the headline: V4-Pro costs roughly one-seventh the output-token price of Claude Opus 4.7 and one-ninth that of GPT-5.5. V4-Flash undercuts every frontier-tier small model on the market, including OpenAI's GPT-5.4 Nano.
- V4-Pro is not a home-lab model. Even V4-Flash needs a 128 GB Apple Silicon machine or a multi-GPU rig with aggressive quantization. For most readers, V4 is an API story first and a self-hosting story second.

What DeepSeek Just Released
On April 23, 2026, DeepSeek released a two-model preview of its V4 series: a larger V4-Pro and a smaller, faster V4-Flash. Both are Mixture-of-Experts (MoE) architectures, both ship under the permissive MIT License on Hugging Face, and both support a 1 million token context window as the default. The release comes a year after DeepSeek's V3 and R1 models rattled global markets and reset the conversation around open-source AI.
The headline specs put V4-Pro in a category of its own. At 1.6 trillion total parameters, it overtakes Moonshot AI's Kimi K2.6 (1.1 trillion) and Z.ai's GLM-5.1 (754 billion) to become the largest open-weight model ever released, as TechCrunch confirmed on release day. V4-Flash is a smaller, cheaper sibling aimed at high-volume workloads where latency and cost matter more than maximum capability.
| Specification | DeepSeek V4-Pro | DeepSeek V4-Flash |
|---|---|---|
| Total parameters | 1.6 trillion | 284 billion |
| Active parameters (per token) | 49 billion | 13 billion |
| Context window | 1 million tokens | 1 million tokens |
| Pre-training tokens | 33 trillion | 32 trillion |
| On-disk size (mixed precision) | ~865 GB | ~160 GB |
| License | MIT | MIT |
| Reasoning modes | Non-Thinking, Thinking, Max | Non-Thinking, Thinking, Max |
| Modalities | Text only | Text only |
One housekeeping note for anyone with existing DeepSeek integrations: the old deepseek-chat and deepseek-reasoner endpoints are being retired on July 24, 2026 at 15:59 UTC. Requests made to those model strings after that cutoff will fail. DeepSeek is currently routing them to V4-Flash in Non-Thinking and Thinking modes respectively, but anyone still pointing production code at the old names has roughly three months to migrate.
Why V4 Is Architecturally Interesting (and Why That Matters for Cost)
The most consequential part of this release is not the parameter count. It is how much compute and memory DeepSeek squeezed out of the long-context attention mechanism. In DeepSeek's own published numbers, at a 1 million token context, V4-Pro uses only 27% of the single-token inference FLOPs and 10% of the KV cache memory compared to V3.2. V4-Flash pushes that even further, to roughly 10% of the FLOPs and 7% of the KV cache. That is the engineering story underneath the pricing story.
Four architectural choices drive those gains:
- Hybrid attention (CSA + HCA). V4 replaces standard full attention with a combination of Compressed Sparse Attention and Heavily Compressed Attention — marketed collectively as "DeepSeek Sparse Attention." The model processes long contexts without having to compare every token to every other token at full precision.
- Manifold-Constrained Hyper-Connections (mHC). A modification to the standard residual connections in the transformer stack. The practical effect is more stable signal propagation across layers, which matters a lot at this scale.
- Muon optimizer. A newer optimizer for training stability and faster convergence, first popularized by Moonshot AI's Kimi series and now adopted by DeepSeek. This kind of cross-lab diffusion of technique is part of why Chinese open-source labs have been iterating quickly.
- FP4 + FP8 mixed precision. Quantization-aware training means the model was trained with the expectation that its weights would be stored at lower precision. MoE expert parameters sit in FP4; the rest in FP8. That is what keeps the on-disk footprint in the 160 GB to 865 GB range rather than multiple terabytes.
There is also a supply-chain detail worth noting. According to The Register's reporting on the release, DeepSeek validated V4's fine-grained expert-parallel scheme on both Nvidia GPUs and Huawei Ascend NPUs, and partnered closely with Huawei during training. This mirrors what we covered when Z.ai released GLM-5.1, which was trained entirely on Huawei Ascend 910B chips with zero Nvidia hardware. The Chinese open-source frontier is increasingly being developed on non-Nvidia compute, which has long-term implications for export controls and hardware supply chains that are worth watching.
Benchmarks — Where V4-Pro Wins, Where It Trails
DeepSeek's technical report is unusually honest about where the model comes up short. In their own words, V4 "falls marginally short of GPT-5.4 and Gemini 3.1-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months." That self-assessment is worth anchoring before looking at any specific benchmark, because the "V4 beats everything" framing circulating on social media is not what DeepSeek itself claims.
With that calibration, here is the picture DeepSeek's own tables show when V4-Pro-Max (the highest reasoning effort mode) is compared against Claude Opus 4.6 Max, GPT-5.4 xHigh, and Gemini 3.1 Pro High:
| Benchmark (category) | V4-Pro-Max | Opus 4.6 Max | GPT-5.4 xHigh | Gemini 3.1 Pro High |
|---|---|---|---|---|
| SimpleQA-Verified (knowledge) | 57.9 | 46.2 | 45.3 | 75.6 |
| GPQA Diamond (science) | 90.1 | 91.3 | 93.0 | 94.3 |
| HLE (hard reasoning) | 37.7 | 40.0 | 39.8 | 44.4 |
| LiveCodeBench (coding) | 93.5 | 88.8 | — | 91.7 |
| Codeforces (rating) | 3206 | — | 3168 | 3052 |
| Apex Shortlist (math/STEM) | 90.2 | 85.9 | 78.1 | 89.1 |
| SWE Verified (real-world code) | 80.6 | 80.8 | — | 80.6 |
| Terminal Bench 2.0 (agentic) | 67.9 | 65.4 | 75.1 | 68.5 |
Source: DeepSeek V4 technical report. Bold indicates the leading score on that benchmark. Dashes indicate the lab did not publish a comparable number.
The pattern is coherent. V4-Pro is genuinely best-in-class on competitive programming, pure code generation, and structured math problems. It ties or beats Claude Opus 4.6 on real-world software engineering (SWE Verified) and nudges ahead on agentic terminal execution. Where it lags is factual knowledge retrieval and expert-level cross-domain reasoning, particularly against Gemini 3.1 Pro. That is the "3 to 6 month" gap DeepSeek is willing to name in print.
Independent evaluators broadly support these findings. Artificial Analysis ranked V4-Pro first among all open-weight models on GDPval-AA, their benchmark for economically valuable knowledge work. Vals AI called V4 "overwhelmingly" the number-one open-weight model on its Vibe Code Benchmark, with a roughly tenfold jump in score compared to V3.2. Simon Willison, who runs one of the more skeptical public LLM evaluation blogs, described V4-Pro as "almost on the frontier, a fraction of the price."
The honest summary: if your workload is code generation, competitive programming, or math-heavy reasoning, V4-Pro is competitive with or ahead of Claude Opus 4.6 and GPT-5.4 at a fraction of the cost. If your workload depends heavily on factual recall or bleeding-edge expert reasoning, Gemini 3.1 Pro and Claude Opus 4.7 still hold a measurable edge.
Pricing — The Real Headline
The benchmark story is interesting. The pricing story is what reshapes the market. Here is how V4 prices against the current frontier-tier APIs, per million tokens:
| Model | Input (cache hit) | Input (cache miss) | Output |
|---|---|---|---|
| DeepSeek V4-Flash | $0.028 | $0.14 | $0.28 |
| DeepSeek V4-Pro | $0.145 | $1.74 | $3.48 |
| Gemini 3.1 Pro | — | $2.00 | $12.00 |
| Claude Opus 4.7 | — | $5.00 | $25.00 |
| GPT-5.5 | — | $5.00 | $30.00 |
Prices in USD per million tokens. Cache-hit pricing applies to repeated system prompts that hit DeepSeek's prompt cache. Competitor cache-hit pricing omitted for direct comparability.
The clean arithmetic on cache-miss pricing, which is the most conservative comparison:
- On output tokens, V4-Pro at $3.48 is roughly one-seventh the price of Claude Opus 4.7 at $25 and one-ninth the price of GPT-5.5 at $30.
- On input tokens (cache miss), V4-Pro at $1.74 is roughly one-third of Opus 4.7 and GPT-5.5 at $5.
- On output tokens, V4-Flash at $0.28 is roughly 1/90th of Opus 4.7 and 1/107th of GPT-5.5. This is the number that is genuinely difficult to internalize the first time you see it.
Cache-hit pricing widens the gap further. A repetitive workflow that reuses the same long system prompt can see V4-Pro input costs drop to $0.145 per million tokens, which is 34 times cheaper than paying full price on Opus 4.7 input. These are not marketing round-downs; VentureBeat's pricing analysis confirms the same arithmetic against current Anthropic and OpenAI list prices.
There is one caveat worth naming. DeepSeek itself acknowledges the Pro service is throughput-limited at launch because of "high-end compute constraints," and the company has signaled that prices may drop further once 950 new Huawei Ascend supernodes come online later in 2026. Pricing at this level is not a subsidy — the architectural efficiency gains are real — but it is also not guaranteed to be stable.
Can You Actually Run This at Home?
This is the question we expect most home-lab readers to arrive with, and the honest answer is layered.
V4-Pro is not a home-lab model. At roughly 865 GB on disk in mixed FP4/FP8 precision, the minimum realistic deployment is eight H100 80 GB GPUs with NVLink, and a DGX H100 node or 8x H200 configuration is recommended for production use. That is data-center territory at a cost well north of $200,000 just for the hardware, before power, cooling, and networking. For nearly every individual reader, V4-Pro is an API model, not a self-hosted one.
V4-Flash is borderline feasible for prosumer hardware. At around 160 GB on disk, it fits on a single Nvidia H200 (141 GB HBM3e), a pair of A100 80 GB GPUs, or — with aggressive INT4 quantization and quality trade-offs — four RTX 4090s in a single chassis. The most interesting consumer path is Apple Silicon: a 128 GB unified-memory Mac Studio (M3 Ultra configuration) or a 128 GB M5 MacBook Pro should run a lightly quantized Flash build via llama.cpp or MLX. Simon Willison reported on release day that he expected Unsloth's quantized versions to appear within hours and that he planned to test Flash on his 128 GB M5 MacBook Pro. For context on why unified memory matters so much here, our guide to the best hardware for running local AI models walks through the VRAM-is-everything principle in detail.
For most readers, V4 is an open-weight enterprise model, not a local-runnable one. That distinction matters. Open weights are the sovereignty escape valve — any well-funded organization can self-host without asking permission — but the line between "open weights" and "runs on my desk" has moved further apart with this release. If your goal is a model you can actually run at home, our April 2026 roundup of the best open-source LLMs and the hardware you need remains the right starting point. Qwen3.6, Gemma 4, and the smaller Llama variants are where practical home-lab inference actually lives.
If you are storing multiple model weights locally for experimentation, fast NVMe storage becomes genuinely important. A 160 GB Flash checkpoint plus the smaller models from the comparison article will fill 2 TB quickly. The Samsung 990 EVO Plus 2 TB on Amazon is a reliable Gen 4 option for single-model workflows, and the Samsung 990 EVO Plus 4 TB is the practical choice if you plan to keep several quantized checkpoints on disk at once. (Affiliate disclosure: ModemGuides earns from qualifying Amazon purchases.)
The Data Sovereignty Question
Using DeepSeek's hosted API means your prompts and outputs traverse Chinese infrastructure. For most personal use, this is not meaningfully different from sending prompts to OpenAI or Anthropic servers in the United States — you are trusting a foreign operator with your tokens either way. For regulated industries, IP-sensitive enterprise workloads, or anyone subject to data-residency requirements, it is a more serious consideration.
One factual note for context: DeepSeek was named in Anthropic's February 2026 distillation report as one of three Chinese labs accused of using proxy accounts to extract Claude's capabilities at scale, a dynamic we covered in our Claude Code leak deep-dive. That history does not make V4 less technically impressive, but it is legitimate context when deciding where to send your prompts.
There are three reasonable paths if the data-residency question matters to you:
- Self-host V4-Flash on a 128 GB Mac or multi-GPU rig. Slower and more expensive per token than the hosted API, but your data never leaves your network. This is the full sovereignty option.
- Use a non-DeepSeek API provider. OpenRouter and several other API aggregators host DeepSeek V4 on U.S. or European infrastructure. Same weights, different operator. Read their data-handling policies before you trust them.
- Stick with models you already trust for sensitive work and route only non-sensitive traffic to V4 for cost savings. A well-designed routing layer can capture most of the pricing upside without putting regulated data on Chinese servers.
The MIT license is what makes the first option possible. That is the practical value of open weights: you do not have to take the operator's word for where your data goes, because you can verify it by running the model yourself.
What This Means for the Open vs. Closed AI Race
Three frontier-adjacent open-weight models have shipped in the last six weeks: GLM-5.1 from Z.ai, Kimi K2.6 from Moonshot AI, and now DeepSeek V4. All three are within striking distance of Claude Opus 4.7 and GPT-5.5 on most coding and agentic benchmarks. All three are MIT or near-MIT licensed. All three are materially cheaper than the closed-source alternatives.
The pricing pressure on closed providers is now structural rather than promotional. It is driven by genuine architectural efficiency — the CSA/HCA attention in V4, the Engram memory approach in GLM-5.1, the MoE sparsity patterns across all three — not by subsidy. Closed providers will need to either match on cost, which compresses their margins, or differentiate harder on quality, safety, and integration depth, which is a different competitive posture than they have been running.
For ModemGuides readers, the practical upshot is that both directions of the digital sovereignty toolkit got better in the same week. Hosted AI is dramatically cheaper. Self-hosted AI is more credibly competitive with the proprietary frontier than it has ever been. The question has shifted from "can open models keep up?" to "which open model fits the problem I actually have?"
This is the same principle that underwrites every recommendation on this site. When you control the infrastructure, you control the experience. Owning your modem beats renting your modem. Running AI models on hardware you own beats routing every query through a cloud API you rent. The V4 release makes the second half of that sentence easier to act on, not harder.
How to Try V4 Right Now
The fastest ways to test V4 against your own workload, in order of friction:
- Web interface. Go to chat.deepseek.com. "Expert Mode" routes to V4-Pro; "Instant Mode" routes to V4-Flash. Free with rate limits.
- Direct API. V4 is a drop-in replacement for existing DeepSeek integrations. The API supports both OpenAI ChatCompletions and Anthropic-compatible endpoints. For most users, the only change is the model string:
# Before
client.chat.completions.create(model="deepseek-chat", ...)
# After
client.chat.completions.create(model="deepseek-v4-pro", ...)
# Or, for the cheaper/faster variant:
client.chat.completions.create(model="deepseek-v4-flash", ...)
- Coding agents. DeepSeek called out integration with Claude Code, OpenClaw, and OpenCode in the launch materials. For sovereignty-conscious agent setups — where you want local execution with a choice of model backend — our roundup of the best OpenClaw alternatives for 2026 covers the practical options.
-
Local inference (Flash only, with caveats). Weights are on Hugging Face at
deepseek-ai/DeepSeek-V4-Flash. vLLM and SGLang are the recommended inference frameworks. Expect the community to publish optimized GGUF and AWQ quantizations via Unsloth and similar projects within days of launch. - Third-party API hosts. OpenRouter and other aggregators will surface V4 on non-DeepSeek infrastructure if data residency matters for your workload.
Frequently Asked Questions
Is DeepSeek V4 really open source?
Yes. Both V4-Pro and V4-Flash are released under the MIT License on Hugging Face, which permits commercial use, modification, and redistribution with minimal restrictions. This is a permissive open-source license, not a restricted or modified one. You can download the weights, fine-tune them for your own purposes, and run them on your own infrastructure without asking DeepSeek for permission.
What's the difference between V4-Pro and V4-Flash?
V4-Pro is the larger, more capable model at 1.6 trillion total parameters with 49 billion active per token. V4-Flash is smaller and cheaper at 284 billion total parameters with 13 billion active. Both share the same 1 million token context window and the same three reasoning effort modes. For most workloads, V4-Flash delivers 85 to 95 percent of Pro's quality at roughly one-twelfth the output-token cost. Reserve V4-Pro for tasks where you specifically need its stronger agentic coding, long-context reasoning, or knowledge capabilities.
Can I run DeepSeek V4 on my home computer?
V4-Pro, no — it requires at least 8 H100 80 GB GPUs, which is data-center hardware. V4-Flash is borderline feasible on a 128 GB unified-memory Apple Silicon machine (Mac Studio M3 Ultra or M5 MacBook Pro 128 GB) with community quantization, or on a multi-GPU rig with INT4 quantization. For models you can actually run on typical home hardware, Qwen3.6, Gemma 4, and smaller Llama variants remain the practical choices — our local AI hardware guide covers the full picture.
How does V4 compare to Claude Opus 4.7 and GPT-5.5?
On competitive programming (Codeforces), pure code generation (LiveCodeBench), and structured math problems (Apex Shortlist), V4-Pro leads or matches both. On real-world software engineering (SWE Verified), V4-Pro ties Claude Opus 4.6. On factual knowledge retrieval (SimpleQA-Verified) and expert-level reasoning (HLE, GPQA Diamond), Gemini 3.1 Pro and Claude Opus 4.7 hold a measurable edge. DeepSeek's own paper describes V4 as trailing the frontier by "approximately 3 to 6 months" on the hardest reasoning tasks — an unusually honest self-assessment.
Is it safe to send sensitive data to DeepSeek's API?
That depends on what you mean by "sensitive." For personal use and non-regulated workloads, the calculus is similar to sending prompts to any foreign-hosted AI API. For regulated industries, IP-sensitive enterprise work, or anyone with data-residency obligations, the Chinese hosting location and DeepSeek's history in distillation accusations are both legitimate concerns. The three reasonable mitigations are self-hosting V4-Flash locally, using a non-DeepSeek API provider like OpenRouter, or routing only non-sensitive traffic to V4 while keeping regulated data on providers you have already vetted.
What hardware do I need to self-host V4-Flash?
The realistic paths are a single Nvidia H200 with 141 GB HBM3e, a pair of A100 80 GB GPUs, four RTX 4090s with INT4 quantization and measurable quality trade-offs, or a 128 GB unified-memory Apple Silicon machine (Mac Studio M3 Ultra or M5 MacBook Pro 128 GB) running a community-quantized GGUF or MLX build. vLLM and SGLang are the recommended inference frameworks for GPU deployments; llama.cpp and MLX are the typical choices on Apple Silicon.
When will the existing deepseek-chat and deepseek-reasoner models stop working?
DeepSeek has scheduled the retirement of both legacy model strings for July 24, 2026 at 15:59 UTC. Until then, requests to those endpoints are being routed to V4-Flash in Non-Thinking and Thinking modes respectively. Any production code still pointing at deepseek-chat or deepseek-reasoner should migrate to deepseek-v4-pro or deepseek-v4-flash before that cutoff.

