The Right Model for the Memory You Own (June 2026)
Last updated: June 2026
The bottleneck for local AI is no longer intelligence. It is memory. The past six weeks alone delivered Gemma 4 12B, Step 3.7 Flash, LFM2.5-8B-A1B, and JetBrains Mellum2, and every one of them targets a different slice of the hardware people actually own. Tier lists ranking "the best model for your GPU" now circulate on social media weekly, and they are genuinely useful, but they share a flaw: nobody checks the claims.
We did. Every model in this guide was verified in June 2026 against its Hugging Face repository, its license text, and its tooling support. Benchmark numbers are labeled by who ran them. One model from this week's viral lists could not be verified at all, and you will find it conspicuously absent below. If you have not picked your hardware yet, start with our local AI hardware guide, then come back here for the software side.
- The best verified picks by tier: LFM2.5-8B-A1B for 8–12GB, Gemma 4 12B for 16GB, the Qwen3.6-27B family for 24–32GB, and Step 3.7 Flash for 128GB+ unified-memory machines.
- Roughly 4-bit (Q4) quantization is what makes the math work: a dense model needs about 0.6 GB of memory per billion parameters, plus headroom for context.
- Self-reported benchmarks and community fine-tunes require vetting before you trust them. We include the checklist, and a real example of a recommendation that failed it.
Jump to your tier: 8–12GB · 16GB · 24–32GB · 48–96GB · 128–256GB · 384GB+ · Vetting fine-tunes
VRAM, Unified Memory, and System RAM: How to Read the Tiers
Three kinds of memory can hold a model, and they are not interchangeable. Dedicated GPU VRAM is the fastest and the most expensive per gigabyte. Unified memory on Apple Silicon and AMD's Ryzen AI Max platforms trades some speed for enormous capacity on a single machine. Plain system RAM works too, especially for sparse Mixture-of-Experts (MoE) models that only activate a fraction of their weights per token, but dense models run slowly on CPU alone.
Quantization is the other half of the math. Models ship trained at 16-bit precision, but compressing the weights to roughly 4 bits (the common Q4 formats) cuts memory use by about 75% with only a minor quality loss for most tasks. The working rule of thumb: a dense model at Q4 needs about 0.6 GB per billion parameters, plus a few extra gigabytes for context. A 27B model lands around 16GB; a 12B model around 7.5GB. For MoE models, total parameters set the memory bill while active parameters set the speed. If you want the deeper mechanics of how modern compression works, our TurboQuant explainer covers it.
| Tier | Typical hardware | What fits at Q4 |
|---|---|---|
| 8–12 GB | RTX 3060 12GB, older GPUs | Up to ~14B dense; small MoE |
| 16 GB | RTX 4060 Ti 16GB, 16GB laptops | Up to ~20B dense with context room |
| 24–32 GB | RTX 3090/4090, RTX 5090, 32GB minis | ~27–35B dense; mid MoE |
| 48–96 GB | 64GB mini PCs, dual GPUs, 64–96GB Macs | ~70B dense; 35B+ MoE at high quality |
| 128–256 GB | Mac Studio, Ryzen AI Max+ 395, DGX Spark | ~200B-class MoE |
| 384 GB+ | 512GB Macs, multi-GPU rigs, servers | 400–750B frontier-class MoE |
Tier 1: 8–12GB VRAM (Entry GPUs and Older Cards)
The pick here is LFM2.5-8B-A1B from Liquid AI, released May 28. It is a sparse MoE with 8.3 billion total parameters but only 1.5 billion active per token, which makes it unusually fast on modest hardware, and it carries a 128K context window with built-in step-by-step reasoning. At Q4 it occupies roughly 5GB, leaving real context headroom even on a 8GB card.
One flag before you build anything on it: this is not plain Apache 2.0. Liquid AI's LFM Open License v1.0 is Apache-based but limits free commercial use to companies under $10 million in annual revenue; above that line, you need a paid license. For personal use, homelabs, and small projects it changes nothing, but it is the kind of clause worth knowing exists. For multimodal work at this size, Google's Gemma 4 E4B edge model is the Apache 2.0 alternative.
Realistic expectations: chat, summarization, document Q&A, and light agent work run well at this tier. The hardware anchor is the 12GB RTX 3060 class, still the cheapest reliable on-ramp to GPU-accelerated local AI.
Check RTX 3060 12GB Price on Amazon
Tier 2: 16GB (The New Sweet Spot)
This is the most interesting tier of 2026 so far, and the headline pick is Gemma 4 12B Unified, released by Google on June 3. It is a dense model that handles text, images, audio, and video in a single encoder-free architecture, carries a 256K context window, and is positioned by Google explicitly as a laptop-class model for 16GB machines. At Q4 it needs roughly 7.5GB, so even long contexts fit. Day-one support landed in Ollama, LM Studio, and llama.cpp.
The license is its own story. Gemma 4 is released under Apache 2.0, a meaningful shift from the restrictive "Gemma Terms of Use" that governed earlier generations and gave legal teams pause. An open-weight model you can use, modify, and ship commercially without a usage agreement is exactly the direction this site wants the industry moving.
For coding specifically, Mellum2 from JetBrains is worth a look: a 12B MoE (2.5B active) under Apache 2.0 that scores 69.9% on LiveCodeBench, the strongest result in its size class, per JetBrains' own technical report. Two honest caveats: independent confirmation of that number is still pending, and early community reports flag rough Ollama compatibility with its custom MoE architecture, so vLLM is currently the supported path.
Check RTX 4060 Ti 16GB Price on Amazon
Tier 3: 24–32GB (Used 3090s, 4090s, and the 5090)
The safe pick is the base everyone keeps fine-tuning for a reason: Qwen3.6-27B, Alibaba's current dense 27B. At Q4 it sits around 16GB, fitting a 24GB card with comfortable context, and GGUF builds run everywhere. For general reasoning, coding help, and agent work on owned hardware, it is the current default at this size.
The community option, with eyes open, is Qwopus3.6-27B-v2, a fine-tune of that same base aimed at cleaner structured reasoning. Credit where due: its author publishes evaluation logs openly. But read them carefully. The headline MMLU-Pro result comes from a 350-question sample, and its SWE-bench run used a 202-task slice; its training data also includes reasoning traces reconstructed from commercial models' outputs. None of that makes it bad, and it is genuinely worth testing since GGUF builds drop straight into Ollama. It does make it exactly the kind of model you evaluate on your own tasks before trusting.
If your 32GB is system RAM rather than VRAM, on an iGPU mini PC for example, Liquid AI's LFM2-24B-A2B was explicitly designed to fit 32GB of RAM, with the same $10M-revenue license flag noted above. The hardware anchor for this tier remains the renewed RTX 3090: 24GB of VRAM at a fraction of new-card pricing, and still the value king of local AI. Our mini PC guide covers the small-form-factor route.
Check RTX 3090 24GB (Renewed) Price on Amazon
Tier 4: 48–96GB (64GB Minis, Dual GPUs, Mid-Range Macs)
The pick is Nex-N2-mini, a post-train of Qwen3.5-35B-A3B from Nex AGI, released under Apache 2.0. It technically squeezes onto a 24GB card at a tight Q4 (~20GB), but it belongs in this tier, where higher-precision quants and long agentic contexts have room to breathe. The lab is legitimate, with a published research paper behind its training method, and its agent-focused design shows in tool-calling work.
Two flags. The benchmark numbers, including the strong Terminal-Bench and GDPval scores on the larger sibling, are self-reported on the lab's own evaluation suite. And Nex recommends serving through its customized sglang fork, which means more setup friction than the one-command Ollama experience; budget an evening, not ten minutes. The smoother-running alternatives at this tier are Gemma 4's larger 26B-A4B and 31B variants with generous context, or a 70B-class dense model at Q4 if raw knowledge depth matters more than speed.
The hardware anchor is the 64GB unified-memory mini PC class, which has quietly become the best dollars-per-capability play for local agents.
Check MINISFORUM X1 Pro (64GB) Price on Amazon
Tier 5: 128–256GB (Mac Studio Class and Unified-Memory Workstations)
This is unified memory's moment, and the pick is Step 3.7 Flash from StepFun, released in late May. It is a 198B-parameter sparse MoE that activates only about 11B per token, pairs a 1.8B vision encoder for native image and document understanding, carries a 256K context window, and ships under clean Apache 2.0 with official GGUF builds and llama.cpp support. At Q4 it lands around 110–120GB, which is exactly why StepFun itself lists the Mac Studio and AMD's Ryzen AI Max+ 395 machines as local deployment targets. Because only 11B parameters fire per token, it runs at interactive speeds despite its size.
Flags, as always: the launch benchmarks, including a reported 56.3% on SWE-Bench Pro that ranks among the top open-weight results, are vendor-stated, and some early users report tool-calling regressions compared with the prior 3.5 version. Verify against your own agent workloads before standardizing on it. For this tier's hardware, Apple's Mac Studio remains the turnkey option, with the AI Max+ 395 mini-workstations as the PC-side counterpart.
Tier 6: 384GB+ (Frontier at Home, Honestly a Niche)
Let's be straight about who this tier is for: homelab enthusiasts, small teams with privacy requirements, and businesses that have done the math on API spend. This is multi-GPU-rig and 512GB-Mac territory, not a casual purchase.
Two verified picks. GLM-5.1 from Z.ai is the open-weight flagship of the moment: 744B total parameters (40B active) under a clean MIT license, with state-of-the-art open-model results on SWE-Bench Pro and community GGUF builds available. We covered its benchmark story in depth in our GLM-5.1 analysis. At Q4 it wants roughly 410–420GB. The second is Nex-N2-Pro, the 397B (17B active) Apache 2.0 sibling of the tier-4 pick, whose self-reported scores reach frontier territory on terminal and knowledge-work benchmarks; a high-precision quant runs about 415GB, with Q4 closer to 220GB.
One omission worth explaining: a community post-train of GLM-5.1 made the rounds in this week's tier lists with a claim of winning eight benchmarks. We could not verify the repository or the claim against any public source, so it is not in this guide. Which brings us to the section every local AI user needs.
How to Vet a Community Fine-Tune Before You Run It
Hugging Face hosts hundreds of new fine-tunes weekly, ranging from genuinely excellent to benchmark-gamed to unverifiable. Model weights are passive data files, not executable code, so the risk is rarely malware. The risk is trusting capability claims that were never real, then building on them. Before you download, run this checklist:
- Confirm the base model and its license. A fine-tune inherits its base model's license terms. If the card does not clearly state the base, stop there.
- Check who ran the benchmarks and on how many samples. A score from a 350-question slice of MMLU-Pro is a signal, not a result. Vendor and author numbers are marketing until independently reproduced.
- Look for published evaluation logs and training-data disclosure. Authors who show their work, as the Qwopus project does, deserve more trust than authors who show only a leaderboard screenshot.
- Check tooling support. GGUF builds that load in Ollama or LM Studio mean low friction. A custom serving fork means you are the QA department.
- Weight independent results over self-reported ones. If a model claims to beat frontier systems and no third party has confirmed it after weeks, that silence is data.
- If you cannot verify it exists as described, skip it. That is not paranoia. That is the same discipline you would apply to firmware from an unknown source.
Once a model passes, our zero-cost local agent stack guide covers turning it into something useful.
Where the Closed Frontier Fits
Nothing in this guide matches Claude Fable 5 or its closed-frontier peers on the hardest long-horizon tasks, and pretending otherwise would insult your intelligence. The frontier gap is real. It is also rented: as we covered in our Fable 5 analysis, the most capable public model launched with a subscription access window measured in days and a mandatory 30-day data retention policy.
Every model on this page is the opposite arrangement. The weights live on your disk, the terms cannot change underneath you, no usage meter runs, and nothing you type leaves your network. For a growing share of real work, that trade is no longer a sacrifice. It is just a choice.
All Picks at a Glance
| Model | Params (total → active) | License | ~Size at Q4 | Best tier |
|---|---|---|---|---|
| LFM2.5-8B-A1B | 8.3B → 1.5B | LFM v1.0* | ~5 GB | 8–12 GB |
| Gemma 4 12B | 12B dense | Apache 2.0 | ~7.5 GB | 16 GB |
| Mellum2 | 12B → 2.5B | Apache 2.0 | ~7.5 GB | 16 GB |
| Qwen3.6-27B | 27B dense | Apache 2.0 | ~16 GB | 24–32 GB |
| Nex-N2-mini | 35B → 3B | Apache 2.0 | ~20 GB | 48–96 GB |
| Step 3.7 Flash | 198B → ~11B | Apache 2.0 | ~110–120 GB | 128–256 GB |
| Nex-N2-Pro | 397B → 17B | Apache 2.0 | ~220 GB | 384 GB+ |
| GLM-5.1 | 744B → 40B | MIT | ~410–420 GB | 384 GB+ |
* LFM Open License v1.0: free for personal use and companies under $10M annual revenue; larger businesses require a commercial license from Liquid AI. Q4 sizes are approximations; add several GB for long contexts.
Frequently Asked Questions
What is the best local AI model for 16GB of RAM or VRAM?
Gemma 4 12B Unified is the current standout: dense, multimodal (text, image, audio, video), 256K context, Apache 2.0, and roughly 7.5GB at Q4, leaving real room for context on a 16GB machine. For coding-focused work, JetBrains Mellum2 is the size-class alternative.
How much memory do I need to run a 27B model?
About 16GB at Q4 quantization for the weights, plus 2–6GB for context depending on length. A 24GB GPU runs a 27B model comfortably; a 16GB card cannot without aggressive quantization that costs noticeable quality.
What does Q4 quantization actually do to quality?
It compresses weights from 16-bit to roughly 4-bit precision, cutting memory use by about 75%. For chat, summarization, and most coding tasks the quality loss is minor; it becomes more noticeable on precise math and long-chain reasoning. Q5 and Q6 formats split the difference when you have spare memory.
Are community fine-tunes from Hugging Face safe to download?
Model weights are passive data files, not executable code, so the malware risk from the weights themselves is low. The real risk is unverified capability claims. Check the base model, the license, who ran the benchmarks and on how many samples, and whether independent results exist before relying on one.
Is unified memory better than GPU VRAM for local AI?
It is a trade. Dedicated VRAM is faster per token; unified memory (Apple Silicon, AMD Ryzen AI Max) offers far more capacity per dollar, which is what lets a single desktop run 198B-class MoE models like Step 3.7 Flash. For large sparse models, capacity wins; for small dense models, a GPU is quicker.
Do these models work with Ollama and LM Studio?
Most do: Gemma 4 12B, Qwen3.6-27B, Qwopus, Step 3.7 Flash, and GLM-5.1 all have GGUF builds that load in standard tools. The exceptions are Mellum2, where Ollama support is currently rough and vLLM is the supported path, and the Nex-N2 models, which are best served through the lab's customized sglang stack.

