Last updated: late April 24, 2026
Key Takeaways
- Five major open-source LLM events landed in the three weeks ending April 23, 2026: Gemma 4, GLM-5.1 open weights, MiniMax M2.7 weights, Qwen3.6-35B-A3B, and now DeepSeek V4 (Pro and Flash). It is the strongest open-model class we have seen, and DeepSeek V4-Pro is the new largest open-weight model ever released.
- The class still splits cleanly into two tiers, with one new wrinkle. Gemma 4 and Qwen3.6-35B-A3B genuinely run on consumer hardware. GLM-5.1, Llama 4 Maverick, and DeepSeek V4-Pro are "open weights" in name but need data-center GPUs to run at full quality. MiniMax M2.7 and DeepSeek V4-Flash sit on a borderline that a 128 GB Mac Studio can just about reach.
- For most readers who want to self-host today, the default pick is still Gemma 4 26B MoE (Apache 2.0, runs in 16 GB of RAM). With 32 GB of unified memory or a 24 GB VRAM GPU, Qwen3.6-35B-A3B is the more capable option. If you have a 128 GB Mac Studio and you want frontier-class open weights, DeepSeek V4-Flash is the new ceiling.
Why April 2026 matters for open-source AI
On April 2, Google DeepMind released Gemma 4 under a fully permissive Apache 2.0 license. On April 7, Z.ai (formerly Zhipu AI) published the GLM-5.1 weights. On April 11, MiniMax put the M2.7 weights on Hugging Face. On April 16, Alibaba's Qwen team shipped Qwen3.6-35B-A3B. On April 23, DeepSeek released V4-Pro and V4-Flash, with V4-Pro overtaking Moonshot AI's Kimi K2.6 (1.1T) and Z.ai's GLM-5.1 (744B) to become the largest open-weight model ever released. Meta's Llama 4 Scout and Maverick (April 2025) round out the field as the long-context Llama flagships.
Five significant release events in three weeks is not normal, even by 2026 standards. What makes the moment worth writing about is not the count. It is that the benchmark gap between the best open models and the best proprietary models has now closed to single digits on the evaluations enterprises actually care about. Simon Willison reported on April 16 that a 20.9 GB quantized Qwen3.6-35B-A3B running on his MacBook Pro produced a better SVG illustration than Anthropic's brand-new Claude Opus 4.7 on his long-running pelican benchmark. A week later, on April 24, he wrote that DeepSeek V4-Pro is "almost on the frontier, a fraction of the price."
That is a single test and a single price comparison. The quality gap on the hardest reasoning tasks still favors proprietary flagships. But the trend line is unambiguous. Open weights that run locally are now genuinely competitive with proprietary flagships on many real tasks, and open weights that run via API are an order of magnitude cheaper. For readers who care about data sovereignty, privacy, and escaping recurring API costs, the question has shifted. It is no longer "is an open model good enough?" It is "which open model fits the hardware I already have, or can reasonably afford?"
The April 2026 open-source landscape at a glance
Here are the eleven models we consider serious contenders this month, grouped by hardware reality so you can skim straight to the tier you can afford to run.
| Model | Released | Params (Total / Active) | Context | License | Runs on consumer hardware? |
|---|---|---|---|---|---|
| Tier 1 — runs on consumer hardware | |||||
| Qwen3.6-35B-A3B | Apr 16, 2026 | 35B / 3B (MoE) | 262K, ext. ~1M | Apache 2.0 | Yes — 20.9 GB at Q4; 32 GB Mac or 24 GB GPU. |
| Gemma 4 31B Dense | Apr 2, 2026 | 31B / 31B | 256K | Apache 2.0 | Yes — ~20 GB at Q4; 24 GB GPU or 32 GB Mac. |
| Gemma 4 26B MoE | Apr 2, 2026 | 26B / 3.8B (MoE) | 256K | Apache 2.0 | Yes — ~16 GB RAM at Q4. Best value. |
| Gemma 4 E4B | Apr 2, 2026 | ~4.5B / ~4B | 256K | Apache 2.0 | Yes — 8–12 GB RAM. Laptop-class. Audio capable. |
| Gemma 4 E2B | Apr 2, 2026 | ~2.3B / ~2B | 256K | Apache 2.0 | Yes — Raspberry Pi 5, phones. Audio capable. |
| Borderline — possible on a 128 GB Mac Studio with quantization | |||||
| DeepSeek V4-Flash | Apr 23, 2026 | 284B / 13B (MoE) | 1M | MIT | Borderline — ~160 GB on disk; quantized fits on 128 GB Mac. |
| MiniMax M2.7 | Apr 11, 2026 (weights) | 230B / 10B (MoE) | 200K | Modified MIT (non-commercial) | Borderline — 108 GB Q4 on 128 GB Mac Studio at ~15 t/s. |
| Tier 2 — needs data-center hardware | |||||
| DeepSeek V4-Pro | Apr 23, 2026 | 1.6T / 49B (MoE) | 1M | MIT | No — ~865 GB on disk; 8x H100 minimum. |
| GLM-5.1 | Apr 7, 2026 | 744B / 40B (MoE) | 200K | MIT | No — ~1.49 TB BF16; 8x H200-class GPUs. |
| Llama 4 Scout | Apr 5, 2025 | 109B / 17B (MoE, 16 experts) | 10M | Llama 4 Community (EU restriction, MAU clause) | Edge case — 54 GB int4 on a single H100. |
| Llama 4 Maverick | Apr 5, 2025 | 400B / 17B (MoE, 128 experts) | 1M | Llama 4 Community (EU restriction, MAU clause) | No — multi-H100 host required. |
A quick note on what we cut. We did not include Mistral Small 4 (a strong March 2026 release but not the news of the moment), NVIDIA Nemotron 3 Super (useful but narrowly scoped), or OpenAI's gpt-oss family (now clearly trailing the April class). DeepSeek V3.2 has been superseded by V4 and is being retired on July 24, 2026; Moonshot AI's Kimi K2.6 (1.1T) sits in the same data-center-only bucket as V4-Pro and GLM-5.1.
The two tiers, and why the distinction matters
When a model card says "open weights," people hear "free." Sometimes that is true. Often it is not. The April 2026 class breaks cleanly into two groups, plus a borderline category that has become more important since the V4 release.
Tier 1: Models that actually run on consumer hardware
Gemma 4 (all four sizes) and Qwen3.6-35B-A3B are genuinely runnable on equipment a person or small team can afford. An E2B variant fits on a Raspberry Pi 5. The 26B MoE Gemma 4 runs in roughly 16 GB of RAM at 4-bit quantization. Qwen3.6-35B-A3B, in the Unsloth Dynamic Q4 build, weighs in at 20.9 GB and runs on a MacBook Pro with 32 GB of unified memory.
This tier is where digital sovereignty actually lives for most readers. Your prompts stay on your machine. Your documents never leave your network. There is no API meter running. And when your ISP has an outage, your AI keeps working.
The borderline tier: open weights you might cajole onto a 128 GB Mac Studio
Two models sit in a category the original April class did not have a clean name for. DeepSeek V4-Flash (~160 GB on disk in mixed FP4/FP8 precision) and MiniMax M2.7 (108 GB at Unsloth Q4) are too big for a typical home machine but can be coaxed onto a 128 GB unified-memory Mac Studio with aggressive quantization. Throughput is honest: roughly 15 tokens per second for MiniMax M2.7, similar territory expected for V4-Flash once Unsloth's quantized GGUFs land. That is slower than running a Tier 1 model on the same hardware, but it puts frontier-class open weights inside a $4,000 box rather than a $40,000 GPU rack.
Tier 2: Open weights that require data-center hardware
GLM-5.1, Llama 4 Maverick, and DeepSeek V4-Pro all have publicly downloadable weights, and all of them are effectively unrunnable at home. GLM-5.1's full BF16 weights weigh approximately 1.49 terabytes and require 8-way tensor parallelism across enterprise GPUs, meaning 8x NVIDIA H200s or equivalent. Llama 4 Maverick needs a multi-node H100 DGX setup for full performance. DeepSeek V4-Pro is the heaviest of the three at ~865 GB, with a realistic minimum of 8x H100 80 GB GPUs and a recommended deployment of 8x H200.
For a deeper dive on the infrastructure side of GLM-5.1 specifically, see our full GLM-5.1 coverage, which walks through why most readers will end up using it through an API or subscription rather than self-hosting. For DeepSeek V4-Pro, our DeepSeek V4 release deep-dive covers benchmarks, pricing, and the data sovereignty question of using DeepSeek's hosted API.
Model-by-model: the consumer-runnable tier
Qwen3.6-35B-A3B: the local-inference champion
Alibaba's Qwen team released Qwen3.6-35B-A3B on April 16, 2026, under the Apache 2.0 license. The architecture is sparse Mixture-of-Experts: 35 billion total parameters but only 3 billion active per inference token. It supports a native 262K context window extensible to roughly 1 million tokens, and it is natively multimodal. Thinking and non-thinking modes are both supported.
What makes it notable is not the parameter count but the benchmark efficiency. On Terminal-Bench 2.0 (agentic terminal coding), Qwen3.6-35B-A3B scored 51.5 compared to Gemma 4 31B's 42.9. On SWE-Bench Verified it reached 73.4 against Gemma 4's 75.0 Dense variant. These are competitive numbers against dense models roughly ten times the active parameter count.
The hardware story matters here. Simon Willison ran the Unsloth-quantized Q4 build (20.9 GB on disk) on a MacBook Pro M5 via LM Studio the morning of release, and reported generation quality that beat Anthropic's brand-new proprietary flagship, Claude Opus 4.7, on a creative SVG benchmark he has maintained for over a year. That is a single test, and the quality gap on harder reasoning tasks likely still favors Opus. But for a model that fits in a 24 GB VRAM GPU or a 32 GB Mac and streams usable tokens per second, the ceiling is remarkable.
Where to get it: the official Qwen repository on Hugging Face, Unsloth's quantized GGUF builds, or through Ollama once the community tags are published.
Gemma 4: the most versatile family on the list
Google DeepMind released the Gemma 4 family on April 2, 2026, in four sizes: E2B (edge, ~2B effective), E4B (mobile-class, ~4B effective), 26B Mixture-of-Experts (3.8B active), and 31B Dense. Every variant is released under the Apache 2.0 license, which is the first time the Gemma family has shipped without a custom license. All four are natively multimodal, and the E2B and E4B edge models additionally handle audio input through an on-device encoder.
The benchmark jumps from Gemma 3 are real and not marginal. The 31B scored 89.2% on AIME 2026 (mathematics), up from 20.8% on Gemma 3 27B. LiveCodeBench v6 went from 29.1% to 80.0%. On the LMArena text leaderboard, Gemma 4 31B currently ranks #3 among open models with an ELO around 1452, and the 26B MoE sits at #6 at 1441 while activating only 3.8 billion parameters per token.
For most readers, the 26B MoE is the sleeper pick. You get quality within a few points of the 31B Dense model at roughly one-eighth the inference compute. The 4-bit quantized version runs in about 16 GB of RAM, which means a reasonably specced mini PC or a MacBook with 24 GB of unified memory handles it comfortably. If you have a single 24 GB NVIDIA GPU such as an RTX 3090 or RTX 4090, you can run the 31B Dense variant at Q4 with headroom.
Install path is as simple as it gets. Gemma 4 ships with day-zero Ollama, llama.cpp, MLX, and LM Studio support. The command ollama run gemma4:26b pulls and runs the MoE variant. If you are new to local inference, this is the lowest-friction entry point in the current class.
Llama 4 Scout: the long-context specialist
Meta's Llama 4 Scout is not a 2026 release. It shipped on April 5, 2025, alongside its larger sibling Maverick, and remains the current open-weight Llama flagship as of this writing. What keeps it relevant is one specification no other open model matches: a 10 million token context window. Scout has 109 billion total parameters with 17 billion active across 16 experts. With int4 quantization, it fits on a single NVIDIA H100 (roughly 54 GB), which places it at the far edge of what a well-funded enthusiast or small team can reasonably run.
The use case where it wins is entire-codebase retrieval, book-length document synthesis, and long-running agentic workflows that need to maintain state across very long sessions. For general chat, reasoning, or coding, the Apache 2.0 Gemma 4 and Qwen3.6 options are a better fit because of their more permissive licensing. Llama 4 ships under Meta's Community License, which restricts use by entities over 700 million monthly active users and blocks EU-based companies from accepting the license terms outright.
The borderline tier: 128 GB Mac Studio territory
DeepSeek V4-Flash: the new ceiling for borderline-feasible self-hosting
DeepSeek released V4-Flash alongside the larger V4-Pro on April 23, 2026, both under the MIT License. Flash is a 284 billion parameter Mixture-of-Experts model with 13 billion active per token. It supports a 1 million token context window by default and offers three reasoning effort modes (Non-Thinking, Thinking, Max). The architecture uses DeepSeek's hybrid attention design (Compressed Sparse Attention plus Heavily Compressed Attention) that drops single-token FLOPs to roughly 10% of V3.2 at 1M context, with an even more dramatic 7% reduction in KV cache memory.
For self-hosting, the practical math is that Flash weighs ~160 GB in DeepSeek's published FP4+FP8 mixed precision. A single Nvidia H200 (141 GB HBM3e) fits it comfortably. Two A100 80 GB cards work. Four RTX 4090s with INT4 quantization can host it at measurable quality cost. The most interesting consumer path is Apple Silicon: a 128 GB unified-memory Mac Studio (M3 Ultra) or a 128 GB M5 MacBook Pro should run a lightly quantized Flash build via llama.cpp or MLX once the community quantizations land. Simon Willison wrote on release day that he expected Unsloth's quantized versions to appear within hours.
The benchmarks are competitive. V4-Flash reaches reasoning quality close to V4-Pro when given a larger thinking budget, and DeepSeek's API prices it at $0.28 per million output tokens, which is roughly 1/107th the cost of GPT-5.5. For most workloads where you do not specifically need the absolute frontier of agentic coding, V4-Flash delivers 85–95% of V4-Pro's quality at a fraction of the resource cost, both hosted and self-hosted.
MiniMax M2.7: read the license before you build on it
MiniMax put the M2.7 weights on Hugging Face on April 11, 2026. The model is a sparse 230 billion parameter MoE with 10 billion active parameters, 256 experts, 62 layers, and a 200K context window. On the SWE-Pro benchmark it scored 56.22%, matching GPT-5.3-Codex. On GDPval-AA it achieved an ELO of 1495, which was the highest among open-weight models until V4-Pro shipped.
The licensing deserves careful attention. The Hugging Face repo displays a "modified-MIT" tag, which sounds permissive. The actual license file tells a different story. Non-commercial use is broadly allowed in an MIT-like fashion. Commercial use, however, requires prior written authorization from MiniMax. This is meaningfully different from the Apache 2.0 terms on Gemma 4 and Qwen3.6, and it is different from the clean MIT license on GLM-5.1 and DeepSeek V4. If you are evaluating M2.7 for a product, a research paper with commercial implications, or anything you intend to ship, you need to go to MiniMax and ask first.
The hardware picture is similar to V4-Flash. Unquantized BF16 weights require 457 GB. The Unsloth Dynamic Q4 GGUF is 108 GB, which fits on a 128 GB unified-memory Mac Studio at roughly 15 tokens per second. Real production inference uses 4x H200-class GPUs minimum.
The "open weights but good luck running them" tier
DeepSeek V4-Pro: the new open-weight king
V4-Pro is a 1.6 trillion parameter Mixture-of-Experts model with 49 billion active per token, released April 23, 2026 under the MIT License. It is the largest open-weight model ever released, surpassing Moonshot AI's Kimi K2.6 (1.1T) and Z.ai's GLM-5.1 (744B). At a 1 million token context, it uses only 27% of the single-token inference FLOPs and 10% of the KV cache memory compared to V3.2.
On DeepSeek's own benchmarks, V4-Pro-Max (the highest reasoning effort mode) leads open-weight competitors on Codeforces (3206 rating), LiveCodeBench (93.5), and Apex Shortlist (90.2). It ties Claude Opus 4.6 on SWE Verified at 80.6, and beats Opus 4.6 narrowly on Terminal Bench 2.0. Where it trails is factual knowledge retrieval and the hardest reasoning benchmarks, particularly against Gemini 3.1 Pro. DeepSeek's own paper describes V4 as trailing the frontier by "approximately 3 to 6 months" on those tasks — an unusually honest self-assessment.
The catch for self-hosting is that V4-Pro is heavier than Llama 4 Maverick, GLM-5.1, or anything else in the open-weight ecosystem. ~865 GB on disk in mixed precision. 8x H100 80 GB GPUs minimum. DGX H100 or 8x H200 recommended for production. For nearly every individual reader, V4-Pro is an API model. DeepSeek's hosted API prices it at $1.74/M input (cache miss), $0.145/M input (cache hit), and $3.48/M output, which is roughly one-seventh the output-token cost of Claude Opus 4.7. For a deep dive on benchmarks, pricing math, and the data sovereignty question of using DeepSeek's Chinese-hosted API, see our DeepSeek V4 release coverage.
GLM-5.1: the current open-weight coding leader on SWE-Bench Pro
Z.ai published the GLM-5.1 open weights on April 7, 2026 under the MIT license. The architecture is a 744 billion parameter Mixture-of-Experts with 40 billion active parameters per token, a 200K context window, and the model can generate up to 128K output tokens in a single response. On SWE-Bench Pro it scored 58.4, making it the #1 open-weight model on that specific benchmark and nudging past GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). It was trained entirely on Huawei Ascend chips, with no NVIDIA hardware involved.
The catch is that none of those capabilities are available to you on consumer hardware. The FP8 quantized version still requires 8-way tensor parallelism across enterprise GPUs. Most readers interested in GLM-5.1 will end up accessing it through Z.ai's API (approximately $1.00 per million input tokens and $3.20 per million output) or through the GLM Coding Plan subscription that starts at $10/month.
Llama 4 Maverick
Llama 4 Maverick (400 billion total / 17 billion active, 128 experts, 1M context) targets the GPT-4o tier. It requires a multi-H100 host and carries the same Llama 4 Community License restrictions as Scout. Realistic via API, not at home. With V4-Pro now claiming the open-weight throne and Maverick approaching its first anniversary, Maverick remains useful primarily for organizations that have already standardized on Llama tooling.
The hardware map: what you actually need to buy
Before you shop, one rule. The single most important specification for local inference is memory, not CPU speed, not storage speed, not even GPU flops. A model that does not fit in RAM or VRAM swaps to disk and drops from 30+ tokens per second to 3-5. That is the difference between "this is a useful tool" and "this is a coffee break." Spend on memory first.
Entry tier (roughly $80 to $300). A Raspberry Pi 5 with a proper heatsink and an NVMe SSD will run the Gemma 4 E2B model at usable speed. The CanaKit Raspberry Pi 5 Starter Kit PRO (8 GB, 128 GB storage) is the cleanest entry point. Expect 10-20 tokens per second on E2B, and light general chat capability. Not a coding workstation, but a genuine always-on local AI for your home network.
Mainstream tier (roughly $300 to $800). This is the Gemma 4 26B MoE sweet spot. A mini PC with 32 GB of DDR5 RAM runs the MoE variant at conversational speed without a dedicated GPU. Options from our inventory that we actually recommend: the Beelink SER8 (Ryzen 7 8745HS, 24 GB DDR5, 1 TB) or the MINISFORUM UM880 Plus (Ryzen 7 8845HS, 16 GB DDR5), which is upgradeable to 64 GB. If you can stretch to 32 GB factory, the MINISFORUM UM880 Plus 32 GB / 1 TB is a better long-term pick.
Enthusiast tier (roughly $700 to $1,500). This is the Qwen3.6-35B-A3B and Gemma 4 31B tier. Two clean paths. On the NVIDIA side, a used RTX 3090 with 24 GB of VRAM remains the best dollar-per-VRAM value for local AI in 2026. Options include the NVIDIA RTX 3090 Founders Edition 24 GB (renewed), the ASUS ROG Strix RTX 3090 OC 24 GB (renewed), or the MSI RTX 3090 Suprim X 24 GB (renewed). On the Apple side, a Mac Mini M4 with 32 GB of unified memory handles Qwen3.6 Q4 at roughly 70-80 tokens per second once Ollama 0.19's MLX acceleration is active. (Apple affiliate program applications are pending for us, so for Mac pricing, head directly to apple.com.)
Power user tier (roughly $1,500 to $3,000+). For running 70B-class dense models or the borderline-tier MoE models (V4-Flash, MiniMax M2.7) at Q4, you need 64 GB of RAM minimum, and 128 GB if you want to fit V4-Flash or M2.7 onto a single machine. The MINISFORUM AI X1 Pro-370 (Ryzen AI 9 HX370, 32 GB DDR5, 1 TB, 890M, WiFi 7, OCuLink) is our preferred Linux-side pick at the lower end of this range. For maximum local headroom, the MINISFORUM X1 Pro 370 with 64 GB RAM / 1 TB lets you comfortably serve multiple models concurrently. For NVIDIA builds, the RTX 4090 or 5090 extends what you can run, but availability is rotating and prices are still inflated. A 128 GB Mac Studio is the cleanest path for V4-Flash or MiniMax M2.7 specifically — and is the only realistic single-machine path to frontier-class open weights as of this writing.
Storage matters more than you think. A 20 GB model download is fine. Five 20 GB model downloads plus quants plus cached context is not. And once V4-Flash arrives at ~160 GB, a single model file fills a budget SSD on its own. The Samsung 990 EVO Plus 2 TB NVMe is the best value for most setups, and the Samsung 990 EVO Plus 4 TB NVMe makes sense if you plan to keep a library of large quantized models on disk.
For a more detailed breakdown of exactly which mini PC or GPU to pick for your situation, we have our full local AI hardware guide and a focused roundup of the best mini PCs for running Ollama locally.
Privacy and security: local is not automatic
Running a model on your own hardware is the necessary first step for digital sovereignty. It is not the sufficient step. A local LLM stack that binds to 0.0.0.0 and sits on your main LAN with no isolation is not meaningfully more private than a cloud service. Anything on your network can talk to it. Anything it talks to can influence it.
The baseline hygiene: keep Ollama on its default localhost binding unless you have a specific reason to expose it. Run inference inside a Docker container when possible. Put AI workloads on their own VLAN if your router supports segmentation. Monitor outbound DNS with Pi-hole or a similar network-level tool so you see it immediately if a plugin starts phoning home to an unfamiliar domain. Only pull model weights from verified sources on Hugging Face, and treat community-uploaded quantizations with the same caution you would treat any software installation.
If you are building something more ambitious on top of your local models, such as a personal knowledge base, the patterns matter more. We walk through the full setup in our guide to building a private LLM-maintained knowledge base with Obsidian and Ollama, including the network isolation considerations for making it actually private.
So which one should you pick?
The default for most readers: Gemma 4 26B MoE. Apache 2.0 license, native multimodality, runs in 16 GB of RAM at Q4, supported on day one in every major local inference stack. If you are new to local AI and you want one recommendation, this is it.
If you have 32 GB of unified memory or a 24 GB GPU: Qwen3.6-35B-A3B. Higher ceiling on agentic coding and long-context work than Gemma 4 31B, and the Unsloth Q4 build is genuinely runnable on a well-specced laptop.
If you have a 128 GB Mac Studio and want frontier-class open weights at home: DeepSeek V4-Flash. This is the new ceiling for borderline-feasible self-hosting, and the MIT license means no commercial-use restrictions to navigate. Wait for Unsloth's quantized GGUF releases, which typically arrive within days of a major launch.
If you want to run AI on a Raspberry Pi: Gemma 4 E2B or E4B. Multimodal, audio-capable on the edge variants, Apache 2.0.
If you want maximum capability and are fine with hosted access: DeepSeek V4-Pro via API. ~1/7 the output cost of Claude Opus 4.7, frontier-adjacent benchmarks, MIT-licensed weights if you change your mind on hosting later. Read our V4 release coverage for the data sovereignty considerations of routing prompts through DeepSeek's Chinese-hosted infrastructure.
If you only care about coding and want hosted access: GLM-5.1 via the Z.ai API or OpenRouter. Still the leader on SWE-Bench Pro specifically, and the pricing is an order of magnitude below the proprietary alternatives.
What we do not recommend without research. MiniMax M2.7 for commercial products, unless you have written authorization from MiniMax. Llama 4 for EU-based readers or any organization approaching 700 million monthly active users. Both have licensing constraints that are easy to miss if you skim headlines.
Frequently Asked Questions
What is the difference between "open source" and "open weights" for these LLMs?
Open weights means the trained model parameters are publicly downloadable. Open source strictly speaking means the training code, data, and weights are all released under a permissive license that allows modification and redistribution. Most models in this article are more accurately described as open weights. Apache 2.0 and MIT models (Gemma 4, Qwen3.6, GLM-5.1, DeepSeek V4) come closest to full open source in practice because the licenses impose few restrictions on use. Llama 4's Community License and MiniMax M2.7's modified MIT are open weights with meaningful commercial restrictions.
Can I legally use MiniMax M2.7 or Llama 4 for a commercial product?
MiniMax M2.7's license requires prior written authorization from MiniMax for commercial use, regardless of how you acquired the weights. You need to contact MiniMax before shipping. Llama 4 (Scout and Maverick) is governed by the Llama 4 Community License, which permits commercial use unless your organization exceeds 700 million monthly active users, at which point you must seek separate terms from Meta. The Llama 4 license also restricts EU-based entities from accepting the terms, which limits European commercial deployment. DeepSeek V4 (both Pro and Flash) is the cleanest option for commercial use among the larger models, with a standard MIT license.
Do I need a GPU to run a local LLM in 2026?
No. Apple Silicon Macs with 16 GB or more of unified memory run quantized 7B-8B models at conversational speed, and with 32 GB of unified memory they handle Qwen3.6-35B-A3B and Gemma 4 31B at Q4. On the PC side, modern AMD mini PCs with 32 GB of DDR5 RAM run Gemma 4 26B MoE acceptably without any discrete GPU. A dedicated GPU still matters for sustained high-throughput inference or for running models above 30 billion parameters at higher precision, but it is no longer a requirement to get started.
How much RAM do I actually need for a quantized 30B-class model?
Plan for roughly 20 GB of RAM or VRAM for the model weights at Q4 quantization, plus 4-8 GB of headroom for the operating system, context window, and inference runtime overhead. A machine with 32 GB of total memory is the comfortable minimum. If you plan to run multiple models or keep long context windows active, 64 GB gives you meaningful headroom. For the borderline tier (V4-Flash, MiniMax M2.7), a 128 GB unified-memory Mac is the entry point.
Is Qwen3.6-35B-A3B actually better than Gemma 4 31B?
It depends on the task. Qwen3.6-35B-A3B is notably stronger on agentic coding benchmarks such as Terminal-Bench 2.0 (51.5 vs 42.9) and handles longer contexts more gracefully (262K native, extensible to ~1M, vs Gemma 4's 256K). Gemma 4 31B has a stronger ecosystem around it on day one, is better integrated with Google's tooling, and scores higher on a few reasoning benchmarks such as GPQA Diamond. For coding and agentic work, Qwen3.6 is the stronger pick. For general reasoning and ecosystem support, Gemma 4 is safer.
What is the fastest way to try these models before buying hardware?
For Gemma 4 and Qwen3.6-35B-A3B, use Google AI Studio and Qwen Studio respectively. Both offer free web access to the flagship variants with rate limits. For DeepSeek V4, chat.deepseek.com offers free access to both Pro (Expert Mode) and Flash (Instant Mode) with rate limits. For a local trial on a machine you already own, install LM Studio or Ollama and pull the quantized build. On a Mac with 16 GB or more of unified memory, you can have Gemma 4 E4B responding in under five minutes from a clean install.
Now that DeepSeek V4 has shipped, does it change any of these recommendations?
For the local-inference tier, no. Gemma 4 26B MoE remains the default for most readers, and Qwen3.6-35B-A3B remains the better option for those with more memory. V4-Pro is too large to run on consumer hardware, and V4-Flash sits on the same borderline as MiniMax M2.7 — usable only if you have a 128 GB Mac Studio or a multi-GPU rig. For the hosted-API tier, V4 is genuinely transformative: V4-Flash at $0.28 per million output tokens is roughly 1/107 the cost of GPT-5.5, and V4-Pro is roughly 1/7 the cost of Claude Opus 4.7. For developers who want frontier-adjacent quality at an order of magnitude lower cost, V4 is now the obvious starting point. Our DeepSeek V4 release coverage walks through the benchmarks, pricing math, and data sovereignty considerations in detail.

