Every Open-Weight AI Model Worth Running in 2026 — The Complete Cheatsheet
The only open-weight models reference you need. Specs, benchmarks, and hosting requirements for every LLM, image, and audio model worth self-hosting.
Every Open-Weight AI Model Worth Running in 2026 — The Complete Cheatsheet
The only open-weight models reference you need. Specs, benchmarks, and hosting requirements for every LLM, image, and audio model worth self-hosting.
The open-weight AI model landscape in 2026 is unrecognizable from even a year ago. Models like Llama 4, DeepSeek V3.2, Qwen 3.5, GLM-5.1, Gemma 4, Kimi K2.5, and Nemotron 3 now compete head-to-head with closed-source flagships — and you can run them on your own hardware. No API fees. No data leaving your infrastructure. No rate limits.
This cheatsheet is the reference I wish I had when evaluating which open-weight models are actually worth the hardware investment. I’ve cut through the noise and included only models that are production-ready, have clear licensing, and deliver real performance. Whether you’re looking for the best open source AI models for self-hosting, evaluating hardware requirements, or comparing licenses for commercial use, everything you need is on this page.
Tip: Looking for closed-source / API models like Claude, GPT, and Gemini instead? See the Frontier AI Models Cheatsheet.
Why Open Weights?
| Benefit | Description |
|---|---|
| Data Privacy | Your data never leaves your infrastructure |
| Cost Control | No per-token API fees after hardware investment |
| Customization | Fine-tune with LoRA/QLoRA for your domain |
| No Rate Limits | Scale throughput with your hardware |
Meta Llama 4
| Model | Version | Total Params | Active | Context | License | Min Hardware |
|---|---|---|---|---|---|---|
| Llama 4 Maverick | llama-4-maverick-17b-128e | 400B | 17B | 512K–1M | Community | 4× H100 |
| Llama 4 Scout | llama-4-scout-17b-16e | 109B | 17B | 10M | Community | 1× H100 (Q4) |
Both models are natively multimodal (text + vision). Maverick with 128 experts beats GPT-4o and Gemini 2.0 Flash across widely reported benchmarks. Scout’s 10M context window remains the largest available among open-weight models.
Tip: Meta announced Muse Spark (April 2026) as their proprietary next-gen model. Llama 4 remains the latest open-weight family.
DeepSeek
| Model | Version | Total Params | Active | Context | License | Best For |
|---|---|---|---|---|---|---|
| DeepSeek V3.2 | deepseek-v3.2 | 671B | 37B | 128K | MIT | Reasoning-first flagship |
| DeepSeek R1 | deepseek-r1-0528 | 671B | 37B | 128K | MIT | Math & logic reasoning |
V3.2 unifies reasoning and agentic performance via DeepSeek Sparse Attention (DSA) and scalable RL. The high-compute variant V3.2-Speciale performs comparably to GPT-5. R1 has distilled versions (1.5B–70B) that run on consumer hardware with reasoning capabilities.
Tip: DeepSeek V4 (~1T params, 1M context, native multimodal) is expected in the coming weeks but has not launched publicly as of April 14, 2026.
Alibaba Qwen 3.5
| Model | Version | Params | Context | License | Min Hardware |
|---|---|---|---|---|---|
| Qwen3-Coder-480B-A35B | qwen3-coder-480b-a35b | 480B (35B active) | 262K (1M w/ Yarn) | Apache 2.0 | 4× H100 80GB |
| Qwen 3.5 397B-A17B | qwen3.5-397b-a17b | 397B (17B active) | 262K | Apache 2.0 | 4× H100 80GB |
| Qwen 3.5 122B-A10B | qwen3.5-122b-a10b | 122B (10B active) | 262K | Apache 2.0 | 2× A100 80GB |
| Qwen 3.5 35B-A3B | qwen3.5-35b-a3b | 35B (3B active) | 262K | Apache 2.0 | RTX 4090 |
| Qwen 3.5 27B | qwen3.5-27b | 27B (dense) | 262K | Apache 2.0 | 1× A100 40GB |
Qwen 3.5 (February 2026) uses a hybrid Gated Delta Networks + sparse MoE architecture. Native vision-language model. Qwen3-Coder-480B-A35B is purpose-built for agentic coding with 7.5T tokens of training data (70% code), setting SOTA among open models on agentic coding benchmarks.
Tip: The 35B-A3B MoE runs on consumer GPUs with only 3B active params. Qwen 3.6 Plus Preview (1M context) is available via API but not yet as open weights.
Google Gemma 4
| Model | Version | Params | Context | Modality | Min Hardware |
|---|---|---|---|---|---|
| Gemma 4 31B | gemma-4-31b-it | 31B (dense) | 256K | Text + Vision | 1× A100 40GB |
| Gemma 4 26B-A4B | gemma-4-26b-a4b-it | 26B (4B active) | 256K | Text + Vision | RTX 4090 |
| Gemma 4 E4B | gemma-4-e4b-it | 4B effective | 128K | Text + Vision | RTX 3060 12GB |
| Gemma 4 E2B | gemma-4-e2b-it | 2B effective | 128K | Text + Vision | CPU / Mobile |
Released April 2, 2026 under Apache 2.0 (upgraded from Gemma Terms). Configurable thinking modes for chain-of-thought reasoning. All sizes are multimodal with variable aspect ratio and resolution support. Built-in function-calling support.
Tip: The 26B MoE activates only 4B params per token — excellent for high-throughput reasoning on consumer GPUs. The “E” prefix means “effective” parameters (Per-Layer Embeddings use more memory than the count suggests).
Mistral AI
| Model | Version | Params | Context | License | Best For |
|---|---|---|---|---|---|
| Mistral Large 3 | mistral-large-3 | 675B (41B active) | 128K | Apache 2.0 | Multimodal flagship |
| Devstral 2 | devstral-2-123b | 123B (dense) | 256K | Modified MIT | Agentic coding |
| Mistral Small 4 | mistral-small-4-119b | 119B (6B active) | 256K | Apache 2.0 | Unified all-rounder |
| Devstral Small 2 | devstral-small-2-24b | 24B (dense) | 256K | Apache 2.0 | Local coding |
| Voxtral TTS | voxtral-tts | 4B | — | Apache 2.0 | Text-to-speech |
Mistral Small 4 (March 2026) is a 128-expert MoE that unifies Magistral (reasoning), Pixtral (multimodal), and Devstral (coding) into one model with configurable reasoning effort. Devstral 2 is a fully dense 123B model scoring 72.2% on SWE-bench Verified — all parameters participate in every forward pass.
Tip: Devstral 2’s modified MIT license restricts companies with >$20M monthly revenue from using without a separate commercial license. Devstral Small 2 (Apache 2.0) scores 68% on SWE-bench at only 24B params.
Z.ai (GLM)
| Model | Version | Total Params | Active | Context | License | Min Hardware |
|---|---|---|---|---|---|---|
| GLM-5.1 | glm-5.1 | 754B | 40B (MoE) | 200K | MIT | 8× A100 80GB |
| GLM-5 | glm-5 | 744B | 40B (MoE) | 128K | MIT | 8× A100 80GB |
| GLM-5V-Turbo | glm-5v-turbo | MoE | — | 128K | MIT | 4× A100 80GB |
| GLM-4.7-Flash | glm-4.7-flash | 30B | 3B (MoE) | 128K | MIT | RTX 4090 |
Formerly Zhipu AI / THUDM (Tsinghua University). GLM-5.1 (April 7, 2026) is their latest flagship — can work independently for up to 8 hours in a single agentic run. Integrates DeepSeek Sparse Attention (DSA) for efficient long-context. Generates up to 128K output tokens per response.
Tip: GLM-4.7-Flash at 30B total / 3B active runs on consumer GPUs. One of the best local deployment options for coding and tool use.
NVIDIA Nemotron 3
| Model | Version | Total Params | Active | Context | License | Min Hardware |
|---|---|---|---|---|---|---|
| Nemotron 3 Super | nemotron-3-super-120b | 120B | 12B (MoE) | 1M | NVIDIA Open | 2× H100 |
| Nemotron 3 Nano | nemotron-3-nano-30b | 31.6B | 3.2B (MoE) | 1M | NVIDIA Open | RTX 4090 |
| Nemotron 3 Nano 4B | nemotron-3-nano-4b | 4B | — | — | NVIDIA Open | Edge / Jetson |
Released March 2026 at GTC. Hybrid Mamba-2-Transformer MoE architecture with Multi-Token Prediction. Mamba-2 layers handle majority of sequence processing with linear-time complexity, making the 1M context window practical. Trained on 25T tokens.
NVIDIA also announced the Nemotron Coalition — a collaboration with Mistral AI, Perplexity, Cursor, LangChain, and others to co-develop open frontier models.
Tip: Nemotron 3 Nano delivers 3.3× higher throughput than Qwen3-30B-A3B on an H200 at comparable quality. The Nano 4B runs on Jetson edge devices.
Moonshot AI (Kimi)
| Model | Version | Total Params | Active | Context | License | Min Hardware |
|---|---|---|---|---|---|---|
| Kimi K2.5 | kimi-k2.5 | 1T | 32B (MoE) | 256K | Modified MIT | 4× H100 80GB |
Released January 27, 2026. 384 experts with 8 selected per token + 1 shared expert. Native multimodal via MoonViT vision encoder (400M params). Agent Swarm technology can orchestrate up to 100 AI sub-agents working in parallel.
Tip: At 1T total params but only 32B active, Kimi K2.5 delivers frontier intelligence at efficient inference costs. Strong at visual coding — generates code from UI designs and video workflows.
Arcee AI
| Model | Version | Params | Context | License | Best For |
|---|---|---|---|---|---|
| Trinity Large Thinking | trinity-large-thinking | 398B (13B active) | 512K | Apache 2.0 | Enterprise reasoning |
30-person US startup. Released April 3, 2026. 256 experts with 4 active per token (1.56% sparsity ratio). Trained for 33 days on 2,048 NVIDIA Blackwell GPUs on 17T tokens. Generates explicit reasoning traces in <think> blocks. Keeps pace with Claude Opus 4.6 on agent benchmarks (Tau2, PinchBench).
Tip: At $0.90/M output tokens via API, Trinity is ~96% cheaper than comparable proprietary models. One of the few frontier-class US-made open models enterprises can fully own.
MiniMax
| Model | Version | Total Params | Active | Context | License | Best For |
|---|---|---|---|---|---|---|
| MiniMax M2.7 | minimax-m2.7 | 230B | 10B (MoE) | 200K | Apache 2.0 | Self-evolving agentic |
| MiniMax-01 | minimax-01 | 456B | 45.9B (MoE) | 4M | Apache 2.0 | Ultra-long context |
M2.7 (open-sourced April 2026) is the first model to deeply participate in its own development — ran 100+ autonomous rounds of scaffold optimization. Ranks #1 on the Artificial Analysis Intelligence Index. Scores 56.2% on SWE-Pro.
Tip: MiniMax-01’s 4 million token context window remains the largest among open-weight models (excluding Scout’s 10M). M2.7 at 10B active params is exceptionally cost-efficient.
Microsoft Phi
| Model | Version | Params | Context | License | Best For |
|---|---|---|---|---|---|
| Phi-4-Reasoning-Vision | phi-4-reasoning-vision-15b | 15B | 32K | MIT | Multimodal reasoning |
| Phi-4-Reasoning | phi-4-reasoning | 14B | 32K | MIT | STEM, math, coding |
| Phi-4 | phi-4 | 14B | 16K | MIT | General STEM |
| Phi-4-Mini | phi-4-mini | 3.8B | 128K | MIT | Mobile, multilingual |
Phi-4-Reasoning (March 2026) achieves 75.3% on AIME 2024 — approaching full DeepSeek R1 performance at a fraction of the size. Vision variant adds multimodal reasoning across math, science, and UI understanding. Phi-4-Mini supports 200K vocabulary for multilingual.
Tip: Phi-4-Reasoning outperforms DeepSeek-R1-Distill-Llama-70B despite being 5× smaller. Best value for STEM reasoning on consumer hardware.
AI21 Labs (Jamba)
| Model | Version | Total Params | Active | Context | License | Best For |
|---|---|---|---|---|---|---|
| Jamba Large 1.7 | jamba-large-1.7 | 398B | 94B | 256K | Jamba Open | Enterprise long-context |
| Jamba Reasoning 3B | jamba-reasoning-3b | 3B | — | 256K (1M ext.) | Jamba Open | Compact reasoning |
Hybrid SSM-Transformer architecture — 2.5× faster than standard transformers on long contexts. Jamba Large 1.7 is the latest, with improvements in grounding and instruction-following. The 3B reasoning variant handles up to 1M tokens with 2-5× efficiency gains.
Tip: Jamba’s hybrid architecture makes it uniquely fast for long-context workloads. Best choice when you need 256K context with low latency.
Cohere
| Model | Version | Params | Languages | License | Best For |
|---|---|---|---|---|---|
| Tiny Aya Global | tiny-aya-global | 3.35B | 70+ | Open Weight | Multilingual instruction |
| Cohere Transcribe | cohere-transcribe | 2B | 14 | Apache 2.0 | Enterprise speech-to-text |
Tiny Aya (February 2026) covers 70+ languages including specialized variants for Africa/West Asia (Aya-Earth) and South Asia (Aya-Fire). Runs on laptops without internet. Transcribe (March 2026) hits 5.42% WER across 14 languages — production-grade ASR under Apache 2.0.
Tip: Tiny Aya is the best option for multilingual deployment on edge devices. Cohere’s flagship Command models remain API-only.
xAI (Grok)
| Model | Version | Params | Context | License | Best For |
|---|---|---|---|---|---|
| Grok 2.5 | grok-2.5 | 268B (MoE) | 128K | Grok Community | General reasoning |
Grok 2.5 weights released August 2025 under a revocable community license. Prohibits using model outputs to train other AI models. Grok 3 was promised for open-weight release by February 2026 but remains proprietary as of April 2026.
Tip: Grok 2.5 requires 8× 40GB GPUs (~500GB weights). The community license is more restrictive than most — review terms before commercial deployment.
Quick Comparison: LLMs
| Model | Provider | Params (Active) | Context | License | Min Hardware |
|---|---|---|---|---|---|
| Kimi K2.5 | Moonshot AI | 1T (32B) | 256K | Modified MIT | 4× H100 |
| GLM-5.1 | Z.ai | 754B (40B) | 200K | MIT | 8× A100 80GB |
| Mistral Large 3 | Mistral | 675B (41B) | 128K | Apache 2.0 | 8× A100 80GB |
| DeepSeek V3.2 | DeepSeek | 671B (37B) | 128K | MIT | 8× H100 |
| Qwen3-Coder-480B | Alibaba | 480B (35B) | 262K | Apache 2.0 | 4× H100 80GB |
| MiniMax-01 | MiniMax | 456B (45.9B) | 4M | Apache 2.0 | 8× A100 80GB |
| Llama 4 Maverick | Meta | 400B (17B) | 1M | Community | 4× H100 |
| Jamba Large 1.7 | AI21 | 398B (94B) | 256K | Jamba Open | 4× H100 |
| Trinity Large Thinking | Arcee AI | 398B (13B) | 512K | Apache 2.0 | 4× H100 |
| Qwen 3.5 397B | Alibaba | 397B (17B) | 262K | Apache 2.0 | 4× H100 80GB |
| Grok 2.5 | xAI | 268B MoE | 128K | Grok Community | 8× 40GB GPUs |
| MiniMax M2.7 | MiniMax | 230B (10B) | 200K | Apache 2.0 | 2× A100 80GB |
| Devstral 2 | Mistral | 123B (dense) | 256K | Modified MIT | 4× H100 |
| Qwen 3.5 122B | Alibaba | 122B (10B) | 262K | Apache 2.0 | 2× A100 80GB |
| Nemotron 3 Super | NVIDIA | 120B (12B) | 1M | NVIDIA Open | 2× H100 |
| Mistral Small 4 | Mistral | 119B (6B) | 256K | Apache 2.0 | 4× H100 |
| Llama 4 Scout | Meta | 109B (17B) | 10M | Community | 1× H100 (Q4) |
| Gemma 4 31B | 31B (dense) | 256K | Apache 2.0 | 1× A100 40GB | |
| Nemotron 3 Nano | NVIDIA | 31.6B (3.2B) | 1M | NVIDIA Open | RTX 4090 |
| Gemma 4 26B MoE | 26B (4B) | 256K | Apache 2.0 | RTX 4090 | |
| Devstral Small 2 | Mistral | 24B (dense) | 256K | Apache 2.0 | RTX 4090 |
| Phi-4-Reasoning-Vision | Microsoft | 15B | 32K | MIT | RTX 3090 |
| Phi-4-Reasoning | Microsoft | 14B | 32K | MIT | RTX 3090 |
Image Generation (Open Weights)
| Model | Provider | Params | License | Min VRAM | Best For |
|---|---|---|---|---|---|
| Stable Diffusion 3.5 | Stability AI | 8B | Stability Community | RTX 4090 24GB | Self-hosting, LoRA, ControlNet |
| Flux.1 Dev | Black Forest Labs | 12B | NC License | RTX 4090 24GB | Photorealism, research |
| Flux.1 Schnell | Black Forest Labs | 12B | Apache 2.0 | RTX 3090 24GB | Commercial use, speed |
Tip: Flux.1 Schnell uses Apache 2.0 license — fully commercial-ready and 10× faster than Dev.
Audio & Speech (Open Source)
| Model | Provider | Type | Languages | License | Min VRAM |
|---|---|---|---|---|---|
| Cohere Transcribe | Cohere | Speech-to-Text | 14 | Apache 2.0 | RTX 3060 |
| Whisper Large v3 | OpenAI | Speech-to-Text | 99 | MIT | RTX 3060 |
| Voxtral TTS | Mistral | Text-to-Speech | Multi | Apache 2.0 | RTX 3060 |
| Bark | Suno | Text-to-Speech | 13+ | MIT | 12GB |
| Tortoise TTS | Community | Text-to-Speech | English | Apache 2.0 | RTX 3080 |
| MusicGen Large | Meta | Text-to-Music | N/A | CC-BY-NC | RTX 3090 24GB |
Tip: Cohere Transcribe (March 2026) hits 5.42% WER — enterprise-grade ASR that’s commercial-ready from day one. Voxtral TTS (4B params) runs on consumer laptops.
Hardware Tiers for Self-Hosting
Consumer (8-24GB VRAM)
RTX 3080 / 3090 / 4090
- Nemotron 3 Nano 31.6B / 3.2B active (MoE, 1M ctx) ⭐ Best local all-rounder
- GLM-4.7-Flash 30B / 3B active (MoE) ⭐ Local coding + tool use
- Qwen 3.5 35B-A3B 35B / 3B active (MoE) ⭐ Local deployment
- Gemma 4 26B-A4B 26B / 4B active (MoE, 256K ctx) ⭐ Reasoning on consumer GPU
- Gemma 4 31B (Q4)
- Devstral Small 2 24B (dense, 256K ctx)
- Phi-4-Reasoning 14B / 15B (FP16)
- Gemma 4 E4B / E2B (edge / mobile)
Prosumer (40-80GB VRAM)
1-2× A100 / H100
- Qwen 3.5 122B-A10B (MoE)
- Qwen 3.5 27B (dense, FP16)
- Gemma 4 31B (FP16)
- Llama 4 Scout (Q4)
- MiniMax M2.7 230B / 10B active (Q4)
- Nemotron 3 Super 120B / 12B active
- DeepSeek R1 distilled 70B
Enterprise (320GB+ VRAM)
4-8× H100 Cluster
- Kimi K2.5 1T / 32B active (FP8)
- GLM-5.1 754B / 40B active (FP8)
- Mistral Large 3 675B / 41B active (FP8)
- DeepSeek V3.2 671B / 37B active (FP8)
- Qwen3-Coder-480B / 35B active (FP8)
- MiniMax-01 456B / 45.9B active (FP8)
- Llama 4 Maverick 400B / 17B active (FP16)
- Arcee Trinity Large Thinking 398B / 13B active
- Jamba Large 1.7 398B / 94B active
- Qwen 3.5 397B / 17B active (FP8)
- Devstral 2 123B dense
- Mistral Small 4 119B / 6B active
Inference Frameworks
| Framework | Best For | Key Feature |
|---|---|---|
| vLLM | Production APIs | PagedAttention, high throughput |
| SGLang | Structured output | Fast JSON/code generation |
| Ollama | Local inference | One-command setup |
| llama.cpp | CPU/low VRAM | Quantization, runs on laptops |
| TGI | HuggingFace | Docker-ready production |
| ExLlamaV2 | Consumer GPUs | Fastest quantized inference |
License Quick Reference
| License | Models | Commercial Use |
|---|---|---|
| MIT | DeepSeek, Z.ai (GLM), Phi, Bark, Whisper | ✅ Fully permissive |
| Apache 2.0 | Qwen, Gemma 4, Mistral (most), Arcee, MiniMax, NVIDIA, Cohere Transcribe, Flux Schnell | ✅ Fully permissive |
| NVIDIA Open | Nemotron 3 | ✅ Permissive (open weights + data) |
| Llama Community | Llama 4 | ✅ With restrictions |
| Jamba Open | Jamba 1.7 | ✅ With restrictions |
| Modified MIT | Kimi K2.5, Devstral 2 | ✅ With revenue restrictions |
| Grok Community | Grok 2.5 | ⚠️ Revocable, no model training |
| CC-BY-NC | MusicGen | ❌ Non-commercial only |
| Custom NC | Flux.1 Dev | ❌ Non-commercial only |
Key Takeaways
- Best overall open model: GLM-5.1, DeepSeek V3.2, or Kimi K2.5 — compete with closed-source flagships
- Best for reasoning: DeepSeek R1 (MIT), Arcee Trinity Large Thinking (Apache 2.0), or Phi-4-Reasoning (MIT, consumer GPU)
- Longest context: Llama 4 Scout (10M), MiniMax-01 (4M), Nemotron 3 (1M), or Llama 4 Maverick (1M)
- Best for coding agents: Qwen3-Coder-480B, Devstral 2 (123B dense), GLM-5.1, or MiniMax M2.7
- Best for consumer GPUs: Nemotron 3 Nano (1M ctx!), Gemma 4 26B MoE, GLM-4.7-Flash, Qwen 3.5 35B MoE, Devstral Small 2
- Best Apache 2.0 license: Qwen 3.5, Gemma 4, Mistral Small 4, MiniMax, Arcee — fully permissive for commercial use
- Best image generation: Flux.1 Schnell (Apache 2.0) or SD 3.5 for commercial use
- New in 2026: Gemma 4 (Apache 2.0!), Nemotron 3, Kimi K2.5, MiniMax M2.7, Mistral Small 4, Devstral 2
This cheatsheet is updated as new models drop. For the companion reference covering closed-source API models (Claude, GPT, Gemini, Grok), see the Frontier AI Models Cheatsheet.