Best Open-Source LLMs of April 2026 + Hardware Needed

The April 2026 open-source model class — and an honest breakdown of which ones run on consumer hardware and which demand a data center.

Updated on
Best Open-Source LLMs of April 2026 + Hardware Needed

Last updated: April 2026

Key Takeaways

  • Five major open-source LLMs landed in the two weeks ending April 16, 2026: Gemma 4, GLM-5.1 open weights, MiniMax M2.7 weights, and today's Qwen3.6-35B-A3B release. This is the strongest open-model class we have seen.
  • The class has split into two tiers that headlines consistently blur. Gemma 4 and Qwen3.6-35B-A3B genuinely run on consumer hardware. GLM-5.1, MiniMax M2.7, and Llama 4 Maverick are "open weights" in name but require data-center GPUs to run at full quality.
  • For most readers who want to self-host today, the default pick is Gemma 4 26B MoE (Apache 2.0, runs in 16 GB of RAM). If you have 32 GB of unified memory on an Apple Silicon Mac or a 24 GB VRAM GPU, the new Qwen3.6-35B-A3B is the more capable option.

Why April 2026 matters for open-source AI

On April 2, Google DeepMind released Gemma 4 under a fully permissive Apache 2.0 license. On April 5, the Meta Llama 4 models remained the established open-weight flagship from 2025. On April 7, Z.ai (formerly Zhipu AI) published the GLM-5.1 weights. On April 11, MiniMax put the M2.7 weights on Hugging Face. And on the morning of April 16, Alibaba's Qwen team shipped Qwen3.6-35B-A3B.

Five significant releases in fourteen days is not normal, even by 2026 standards. What makes the moment worth writing about is not the count. It is that the benchmark gap between the best open models and the best proprietary models has now closed to single digits on the evaluations enterprises actually care about. Simon Willison reported on April 16 that a 20.9 GB quantized Qwen3.6-35B-A3B running on his MacBook Pro produced a better SVG illustration than Anthropic's brand-new Claude Opus 4.7 on his long-running pelican benchmark.

That result is a party trick, but it is also a signal. Open weights that run locally are now genuinely competitive with proprietary flagships on many real tasks. For readers who care about data sovereignty, privacy, and escaping recurring API costs, the question has shifted. It is no longer "is an open model good enough?" It is "which open model fits the hardware I already have, or can reasonably afford?"

The April 2026 open-source landscape at a glance

Here are the nine models we consider the serious contenders this month, along with the hardware reality check most comparison articles skip.

Model Released Params (Total / Active) License Context Multimodal Runs on consumer hardware?
Qwen3.6-35B-A3B Apr 16, 2026 35B / 3B (MoE) Apache 2.0 262K, extensible to ~1M Yes Yes. 20.9 GB at Q4 quant; 32 GB Mac or 24 GB GPU.
Gemma 4 31B Dense Apr 2, 2026 31B / 31B Apache 2.0 256K Yes Yes. ~20 GB at Q4; 24 GB GPU or 32 GB Mac.
Gemma 4 26B MoE Apr 2, 2026 26B / 3.8B (MoE) Apache 2.0 256K Yes Yes. ~16 GB RAM at Q4. Best value.
Gemma 4 E4B Apr 2, 2026 ~4.5B / ~4B Apache 2.0 256K Yes (incl. audio) Yes. 8-12 GB RAM. Laptop-class.
Gemma 4 E2B Apr 2, 2026 ~2.3B / ~2B Apache 2.0 256K Yes (incl. audio) Yes. Raspberry Pi 5, phones.
GLM-5.1 Apr 7, 2026 744B / 40B (MoE) MIT 200K No No. ~1.49 TB storage; 8x H200-class GPUs.
MiniMax M2.7 Apr 11, 2026 (weights) 230B / 10B (MoE) Modified MIT (non-commercial without authorization) 200K No Barely. 108 GB Q4 fits on 128 GB Mac Studio.
Llama 4 Scout Apr 5, 2025 109B / 17B (MoE, 16 experts) Llama 4 Community (EU restriction, MAU clause) 10M Yes Edge case. 54 GB at int4 on a single H100.
Llama 4 Maverick Apr 5, 2025 400B / 17B (MoE, 128 experts) Llama 4 Community (EU restriction, MAU clause) 1M Yes No. Multi-H100 host required.
Qwen3.6-35B-A3B
ReleasedApr 16, 2026
Params (Total / Active)35B / 3B (MoE)
LicenseApache 2.0
Context262K, extensible to ~1M
MultimodalYes
Runs on consumer hardware?Yes. 20.9 GB at Q4 quant; 32 GB Mac or 24 GB GPU.
Gemma 4 31B Dense
ReleasedApr 2, 2026
Params (Total / Active)31B / 31B
LicenseApache 2.0
Context256K
MultimodalYes
Runs on consumer hardware?Yes. ~20 GB at Q4; 24 GB GPU or 32 GB Mac.
Gemma 4 26B MoE
ReleasedApr 2, 2026
Params (Total / Active)26B / 3.8B (MoE)
LicenseApache 2.0
Context256K
MultimodalYes
Runs on consumer hardware?Yes. ~16 GB RAM at Q4. Best value.
Gemma 4 E4B
ReleasedApr 2, 2026
Params (Total / Active)~4.5B / ~4B
LicenseApache 2.0
Context256K
MultimodalYes (incl. audio)
Runs on consumer hardware?Yes. 8-12 GB RAM. Laptop-class.
Gemma 4 E2B
ReleasedApr 2, 2026
Params (Total / Active)~2.3B / ~2B
LicenseApache 2.0
Context256K
MultimodalYes (incl. audio)
Runs on consumer hardware?Yes. Raspberry Pi 5, phones.
GLM-5.1
ReleasedApr 7, 2026
Params (Total / Active)744B / 40B (MoE)
LicenseMIT
Context200K
MultimodalNo
Runs on consumer hardware?No. ~1.49 TB storage; 8x H200-class GPUs.
MiniMax M2.7
ReleasedApr 11, 2026 (weights)
Params (Total / Active)230B / 10B (MoE)
LicenseModified MIT (non-commercial without authorization)
Context200K
MultimodalNo
Runs on consumer hardware?Barely. 108 GB Q4 fits on 128 GB Mac Studio.
Llama 4 Scout
ReleasedApr 5, 2025
Params (Total / Active)109B / 17B (MoE, 16 experts)
LicenseLlama 4 Community (EU restriction, MAU clause)
Context10M
MultimodalYes
Runs on consumer hardware?Edge case. 54 GB at int4 on a single H100.
Llama 4 Maverick
ReleasedApr 5, 2025
Params (Total / Active)400B / 17B (MoE, 128 experts)
LicenseLlama 4 Community (EU restriction, MAU clause)
Context1M
MultimodalYes
Runs on consumer hardware?No. Multi-H100 host required.

A quick note on what we cut. We did not include Mistral Small 4 (a strong March 2026 release but not the news of the moment), NVIDIA Nemotron 3 Super (useful but narrowly scoped), or OpenAI's gpt-oss family (now clearly trailing the April class). DeepSeek V3.2 is still the publicly available DeepSeek model; V4 was reported by Reuters in early April to be "weeks away" and had not shipped as of April 12. We will update this guide when it does.

The two tiers, and why the distinction matters

When a model card says "open weights," people hear "free." Sometimes that is true. Often it is not. The April 2026 class breaks cleanly into two groups, and conflating them leads to expensive mistakes.

Tier 1: Models that actually run on consumer hardware

Gemma 4 (all four sizes) and Qwen3.6-35B-A3B are genuinely runnable on equipment a person or small team can afford. An E2B variant fits on a Raspberry Pi 5. The 26B MoE Gemma 4 runs in roughly 16 GB of RAM at 4-bit quantization. Qwen3.6-35B-A3B, in the Unsloth Dynamic Q4 build that Simon Willison tested, weighs in at 20.9 GB and runs on a MacBook Pro with 32 GB of unified memory.

This tier is where digital sovereignty actually lives for most readers. Your prompts stay on your machine. Your documents never leave your network. There is no API meter running. And when your ISP has an outage, your AI keeps working.

Tier 2: Open weights that require data-center hardware

GLM-5.1, MiniMax M2.7, Llama 4 Maverick, and DeepSeek V3.2 all have publicly downloadable weights, and all of them are effectively unrunnable at home. GLM-5.1's full BF16 weights weigh approximately 1.49 terabytes and require 8-way tensor parallelism across enterprise GPUs, meaning 8x NVIDIA H200s or equivalent. Llama 4 Maverick needs a multi-node H100 DGX setup for full performance. MiniMax M2.7 at its smallest practical quant is 108 GB and produces roughly 15 tokens per second on a 128 GB unified-memory Mac Studio, which is the fastest consumer path that exists.

For a deeper dive on the infrastructure side of GLM-5.1 specifically, see our full GLM-5.1 coverage, which walks through why most readers will end up using it through an API or subscription rather than self-hosting.

Model-by-model: the consumer-runnable tier

Qwen3.6-35B-A3B: the new local-inference champion

Alibaba's Qwen team released Qwen3.6-35B-A3B on April 16, 2026, under the Apache 2.0 license. The architecture is sparse Mixture-of-Experts: 35 billion total parameters but only 3 billion active per inference token. It supports a native 262K context window extensible to roughly 1 million tokens, and it is natively multimodal. Thinking and non-thinking modes are both supported.

What makes it notable is not the parameter count but the benchmark efficiency. On Terminal-Bench 2.0 (agentic terminal coding), Qwen3.6-35B-A3B scored 51.5 compared to Gemma 4 31B's 42.9. On SWE-Bench Verified it reached 73.4 against Gemma 4's 75.0 Dense variant. These are competitive numbers against dense models roughly ten times the active parameter count.

The hardware story matters here. Simon Willison ran the Unsloth-quantized Q4 build (20.9 GB on disk) on a MacBook Pro M5 via LM Studio the morning of release, and reported generation quality that beat Anthropic's brand-new proprietary flagship, Claude Opus 4.7, on a creative SVG benchmark he has maintained for over a year. That is a single test, and the quality gap on harder reasoning tasks likely still favors Opus. But for a model that fits in a 24 GB VRAM GPU or a 32 GB Mac and streams usable tokens per second, the ceiling is remarkable.

Where to get it: the official Qwen repository on Hugging Face, Unsloth's quantized GGUF builds, or through Ollama once the community tags are published.

Gemma 4: the most versatile family on the list

Google DeepMind released the Gemma 4 family on April 2, 2026, in four sizes: E2B (edge, ~2B effective), E4B (mobile-class, ~4B effective), 26B Mixture-of-Experts (3.8B active), and 31B Dense. Every variant is released under the Apache 2.0 license, which is the first time the Gemma family has shipped without a custom license. All four are natively multimodal, and the E2B and E4B edge models additionally handle audio input through an on-device encoder.

The benchmark jumps from Gemma 3 are real and not marginal. The 31B scored 89.2% on AIME 2026 (mathematics), up from 20.8% on Gemma 3 27B. LiveCodeBench v6 went from 29.1% to 80.0%. On the LMArena text leaderboard, Gemma 4 31B currently ranks #3 among open models with an ELO around 1452, and the 26B MoE sits at #6 at 1441 while activating only 3.8 billion parameters per token.

For most readers, the 26B MoE is the sleeper pick. You get quality within a few points of the 31B Dense model at roughly one-eighth the inference compute. The 4-bit quantized version runs in about 16 GB of RAM, which means a reasonably specced mini PC or a MacBook with 24 GB of unified memory handles it comfortably. If you have a single 24 GB NVIDIA GPU such as an RTX 3090 or RTX 4090, you can run the 31B Dense variant at Q4 with headroom.

Install path is as simple as it gets. Gemma 4 ships with day-zero Ollama, llama.cpp, MLX, and LM Studio support. The command ollama run gemma4:26b pulls and runs the MoE variant. If you are new to local inference, this is the lowest-friction entry point in the current class.

Llama 4 Scout: the long-context specialist

Meta's Llama 4 Scout is not a 2026 release. It shipped on April 5, 2025, alongside its larger sibling Maverick, and remains the current open-weight Llama flagship as of this writing. What keeps it relevant is one specification no other open model matches: a 10 million token context window. Scout has 109 billion total parameters with 17 billion active across 16 experts. With int4 quantization, it fits on a single NVIDIA H100 (roughly 54 GB), which places it at the far edge of what a well-funded enthusiast or small team can reasonably run.

The use case where it wins is entire-codebase retrieval, book-length document synthesis, and long-running agentic workflows that need to maintain state across very long sessions. For general chat, reasoning, or coding, the Apache 2.0 Gemma 4 and Qwen3.6 options are a better fit because of their more permissive licensing. Llama 4 ships under Meta's Community License, which restricts use by entities over 700 million monthly active users and blocks EU-based companies from accepting the license terms outright.

The "open weights but good luck running them" tier

GLM-5.1: the current open-weight coding leader

Z.ai published the GLM-5.1 open weights on April 7, 2026 under the MIT license. The architecture is a 744 billion parameter Mixture-of-Experts with 40 billion active parameters per token, a 200K context window, and the model can generate up to 128K output tokens in a single response. On SWE-Bench Pro it scored 58.4, making it the #1 open-weight model on that benchmark and nudging past GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). It was trained entirely on Huawei Ascend chips, with no NVIDIA hardware involved.

The catch is that none of those capabilities are available to you on consumer hardware. The FP8 quantized version still requires 8-way tensor parallelism across enterprise GPUs. Most readers interested in GLM-5.1 will end up accessing it through Z.ai's API (approximately $1.00 per million input tokens and $3.20 per million output) or through the GLM Coding Plan subscription that starts at $10/month.

MiniMax M2.7: read the license before you build on it

MiniMax put the M2.7 weights on Hugging Face on April 11, 2026. The model is a sparse 230 billion parameter MoE with 10 billion active parameters, 256 experts, 62 layers, and a 200K context window. On the SWE-Pro benchmark it scored 56.22%, matching GPT-5.3-Codex. On GDPval-AA it achieved an ELO of 1495, the highest among open-weight models.

The licensing deserves careful attention. The Hugging Face repo displays a "modified-MIT" tag, which sounds permissive. The actual license file tells a different story. Non-commercial use is broadly allowed in an MIT-like fashion. Commercial use, however, requires prior written authorization from MiniMax. This is meaningfully different from the Apache 2.0 terms on Gemma 4 and Qwen3.6, and it is different from the clean MIT license on GLM-5.1. If you are evaluating M2.7 for a product, a research paper with commercial implications, or anything you intend to ship, you need to go to MiniMax and ask first.

The hardware picture is equally constrained. Unquantized BF16 weights require 457 GB. The Unsloth Dynamic Q4 GGUF is 108 GB, which fits on a 128 GB unified-memory Mac Studio at roughly 15 tokens per second. That is the fastest "desktop" path that exists. Real production inference uses 4x H200-class GPUs minimum.

Llama 4 Maverick and DeepSeek V3.2

Llama 4 Maverick (400 billion total / 17 billion active, 128 experts, 1M context) targets the GPT-4o tier. It requires a multi-H100 host and carries the same Llama 4 Community License restrictions as Scout. DeepSeek V3.2 remains the publicly available DeepSeek model, with approximately 1 trillion total parameters and 32 billion active at inference. V4 has been reported as imminent since February and had not shipped as of April 12, 2026. Both are realistic via API, not at home.

The hardware map: what you actually need to buy

Before you shop, one rule. The single most important specification for local inference is memory, not CPU speed, not storage speed, not even GPU flops. A model that does not fit in RAM or VRAM swaps to disk and drops from 30+ tokens per second to 3-5. That is the difference between "this is a useful tool" and "this is a coffee break." Spend on memory first.

Entry tier (roughly $80 to $300). A Raspberry Pi 5 with a proper heatsink and an NVMe SSD will run the Gemma 4 E2B model at usable speed. The CanaKit Raspberry Pi 5 Starter Kit PRO (8 GB, 128 GB storage) on Amazon is the cleanest entry point. Expect 10-20 tokens per second on E2B, and light general chat capability. Not a coding workstation, but a genuine always-on local AI for your home network.

Mainstream tier (roughly $300 to $800). This is the Gemma 4 26B MoE sweet spot. A mini PC with 32 GB of DDR5 RAM runs the MoE variant at conversational speed without a dedicated GPU. Options from our inventory that we actually recommend: the Beelink SER8 (Ryzen 7 8745HS, 24 GB DDR5, 1 TB) on Amazon or the MINISFORUM UM880 Plus (Ryzen 7 8845HS, 16 GB DDR5) on Amazon, which is upgradeable to 64 GB. If you can stretch to 32 GB factory, the MINISFORUM UM880 Plus 32 GB / 1 TB on Amazon is a better long-term pick.

Enthusiast tier (roughly $700 to $1,500). This is the Qwen3.6-35B-A3B and Gemma 4 31B tier. Two clean paths. On the NVIDIA side, a used RTX 3090 with 24 GB of VRAM remains the best dollar-per-VRAM value for local AI in 2026. Options include the NVIDIA RTX 3090 Founders Edition 24 GB (renewed) on Amazon, the ASUS ROG Strix RTX 3090 OC 24 GB (renewed) on Amazon, or the MSI RTX 3090 Suprim X 24 GB (renewed) on Amazon. On the Apple side, a Mac Mini M4 with 32 GB of unified memory handles Qwen3.6 Q4 at roughly 70-80 tokens per second once Ollama 0.19's MLX acceleration is active. (Apple affiliate program applications are pending for us, so for Mac pricing, head directly to apple.com.)

Power user tier (roughly $1,500 to $3,000+). For running 70B-class dense models or MiniMax M2.7 at Q4, you need 64 GB of RAM or more and strong memory bandwidth. The MINISFORUM AI X1 Pro-370 (Ryzen AI 9 HX370, 32 GB DDR5, 1 TB, 890M, WiFi 7, OCuLink) on Amazon is our preferred Linux-side pick here. For maximum local headroom, the MINISFORUM X1 Pro 370 with 64 GB RAM / 1 TB on Amazon lets you comfortably serve multiple models concurrently. For NVIDIA builds, the RTX 4090 or 5090 extends what you can run, but availability is rotating and prices are still inflated. A 128 GB Mac Studio is the cleanest path for MiniMax M2.7 Q4 specifically.

Storage matters more than you think. A 20 GB model download is fine. Five 20 GB model downloads plus quants plus cached context is not. The Samsung 990 EVO Plus 2 TB NVMe on Amazon is the best value for most setups, and the Samsung 990 EVO Plus 4 TB NVMe on Amazon makes sense if you plan to keep a library of 7-10 quantized models on disk.

For a more detailed breakdown of exactly which mini PC or GPU to pick for your situation, we have our full local AI hardware guide and a focused roundup of the best mini PCs for running Ollama locally.

Privacy and security: local is not automatic

Running a model on your own hardware is the necessary first step for digital sovereignty. It is not the sufficient step. A local LLM stack that binds to 0.0.0.0 and sits on your main LAN with no isolation is not meaningfully more private than a cloud service. Anything on your network can talk to it. Anything it talks to can influence it.

The baseline hygiene: keep Ollama on its default localhost binding unless you have a specific reason to expose it. Run inference inside a Docker container when possible. Put AI workloads on their own VLAN if your router supports segmentation. Monitor outbound DNS with Pi-hole or a similar network-level tool so you see it immediately if a plugin starts phoning home to an unfamiliar domain. Only pull model weights from verified sources on Hugging Face, and treat community-uploaded quantizations with the same caution you would treat any software installation.

If you are building something more ambitious on top of your local models, such as a personal knowledge base, the patterns matter more. We walk through the full setup in our guide to building a private LLM-maintained knowledge base with Obsidian and Ollama, including the network isolation considerations for making it actually private.

So which one should you pick?

The default for most readers: Gemma 4 26B MoE. Apache 2.0 license, native multimodality, runs in 16 GB of RAM at Q4, supported on day one in every major local inference stack. If you are new to local AI and you want one recommendation, this is it.

If you have 32 GB of unified memory or a 24 GB GPU: Qwen3.6-35B-A3B. Higher ceiling on agentic coding and long-context work than Gemma 4 31B, and the Unsloth Q4 build is genuinely runnable on a well-specced laptop. New as of today, so expect tooling polish to improve over the next week.

If you want to run AI on a Raspberry Pi: Gemma 4 E2B or E4B. Multimodal, audio-capable on the edge variants, Apache 2.0.

If you only care about coding and are fine with hosted access: GLM-5.1 via the Z.ai API or OpenRouter. It is the current leader on SWE-Bench Pro, and the pricing is an order of magnitude below the proprietary alternatives.

What we do not recommend without research. MiniMax M2.7 for commercial products, unless you have written authorization from MiniMax. Llama 4 for EU-based readers or any organization approaching 700 million monthly active users. Both have licensing constraints that are easy to miss if you skim headlines.

Frequently Asked Questions

What is the difference between "open source" and "open weights" for these LLMs?

Open weights means the trained model parameters are publicly downloadable. Open source strictly speaking means the training code, data, and weights are all released under a permissive license that allows modification and redistribution. Most models in this article are more accurately described as open weights. Apache 2.0 and MIT models (Gemma 4, Qwen3.6, GLM-5.1) come closest to full open source in practice because the licenses impose few restrictions on use. Llama 4's Community License and MiniMax M2.7's modified MIT are open weights with meaningful commercial restrictions.

Can I legally use MiniMax M2.7 or Llama 4 for a commercial product?

MiniMax M2.7's license requires prior written authorization from MiniMax for commercial use, regardless of how you acquired the weights. You need to contact MiniMax before shipping. Llama 4 (Scout and Maverick) is governed by the Llama 4 Community License, which permits commercial use unless your organization exceeds 700 million monthly active users, at which point you must seek separate terms from Meta. The Llama 4 license also restricts EU-based entities from accepting the terms, which limits European commercial deployment.

Do I need a GPU to run a local LLM in 2026?

No. Apple Silicon Macs with 16 GB or more of unified memory run quantized 7B-8B models at conversational speed, and with 32 GB of unified memory they handle Qwen3.6-35B-A3B and Gemma 4 31B at Q4. On the PC side, modern AMD mini PCs with 32 GB of DDR5 RAM run Gemma 4 26B MoE acceptably without any discrete GPU. A dedicated GPU still matters for sustained high-throughput inference or for running models above 30 billion parameters at higher precision, but it is no longer a requirement to get started.

How much RAM do I actually need for a quantized 30B-class model?

Plan for roughly 20 GB of RAM or VRAM for the model weights at Q4 quantization, plus 4-8 GB of headroom for the operating system, context window, and inference runtime overhead. A machine with 32 GB of total memory is the comfortable minimum. If you plan to run multiple models or keep long context windows active, 64 GB gives you meaningful headroom.

Is Qwen3.6-35B-A3B actually better than Gemma 4 31B?

It depends on the task. Qwen3.6-35B-A3B is notably stronger on agentic coding benchmarks such as Terminal-Bench 2.0 (51.5 vs 42.9) and handles longer contexts more gracefully (262K native, extensible to ~1M, vs Gemma 4's 256K). Gemma 4 31B has a stronger ecosystem around it on day one, is better integrated with Google's tooling, and scores higher on a few reasoning benchmarks such as GPQA Diamond. For coding and agentic work, Qwen3.6 is the stronger pick. For general reasoning and ecosystem support, Gemma 4 is safer.

What is the fastest way to try these models before buying hardware?

For Gemma 4 and Qwen3.6-35B-A3B, use Google AI Studio and Qwen Studio respectively. Both offer free web access to the flagship variants with rate limits. For a local trial on a machine you already own, install LM Studio or Ollama and pull the quantized build. On a Mac with 16 GB or more of unified memory, you can have Gemma 4 E4B responding in under five minutes from a clean install.

Will DeepSeek V4 change any of these recommendations when it launches?

Probably yes for the hosted-API tier, probably not for the local-inference tier. V4 is expected to be a roughly 1 trillion parameter MoE with 32 billion active, a 1 million token context window, and pricing in the $0.14 to $0.30 per million input tokens range, which would make it the most aggressive pricing event of Q2 2026 for anyone using hosted inference. The trillion-parameter total, however, puts the model squarely in Tier 2 from a self-hosting perspective. If you want to run something at home, Gemma 4 and Qwen3.6 will continue to be the answer regardless of what DeepSeek ships.

USA-Based Modem & Router Technical Support Expert

Our entirely USA-based team of technicians each have over a decade of experience in assisting with installing modems and routers. We are so excited that you chose us to help you stop paying equipment rental fees to the mega-corporations that supply us with internet service.

Updated on

Leave a comment

Please note, comments need to be approved before they are published.