LLM Architecture Gallery

A collection of architecture fact sheets for leading Large Language Models, detailing their scale, decoder type, attention mechanisms, and key design choices.
This page collects architecture figures and fact sheets from The Big LLM Architecture Comparison and A Dream of Spring for Open-Weight LLMs. It focuses on the architecture panels only. Click a figure to enlarge it and use the model title to jump to the corresponding article section.
If you spot an inaccurate fact sheet, mislabeled architecture, or broken link, please file an issue here: Architecture Gallery issue tracker.
Llama 3 8B
Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.
- Scale
- 8B parameters
- Date
- 2024-04-18
- Decoder type
- Dense
- Attention
- GQA with RoPE
- Key detail
- Pre-norm baseline; wider than OLMo 2 at a similar scale.
Related concepts
OLMo 2 7B
Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.
- Scale
- 7B parameters
- Date
- 2024-11-25
- Decoder type
- Dense
- Attention
- MHA with QK-Norm
- Key detail
- Uses inside-residual post-norm instead of the usual pre-norm layout.
Related concepts
DeepSeek V3
DeepSeek's flagship template kicked off the recent wave of large open MoE models.
- Scale
- 671B total, 37B active
- Date
- 2024-12-26
- Decoder type
- Sparse MoE
- Attention
- MLA
- Key detail
- Uses a dense prefix plus a shared expert to keep a very large model practical at inference.
Related concepts
DeepSeek R1
Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.
- Scale
- 671B total, 37B active
- Date
- 2025-01-20
- Decoder type
- Sparse MoE
- Attention
- MLA
- Key detail
- Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe.
Related concepts
Gemma 3 27B
Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.
- Scale
- 27B parameters
- Date
- 2025-03-11
- Decoder type
- Dense
- Attention
- GQA with QK-Norm and 5:1 sliding-window/global attention
- Key detail
- Built around a 27B sweet spot with heavier local attention and a large multilingual vocabulary.
Related concepts
Mistral Small 3.1 24B
Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.
- Scale
- 24B parameters
- Date
- 2025-03-18
- Decoder type
- Dense
- Attention
- Standard GQA
- Key detail
- Latency-focused design with a smaller KV cache and fewer layers than Gemma 3 27B.
Related concepts
Llama 4 Maverick
Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.
- Scale
- 400B total, 17B active
- Date
- 2025-04-05
- Decoder type
- Sparse MoE
- Attention
- GQA
- Key detail
- Alternates dense and MoE blocks and uses fewer, larger experts than DeepSeek V3.
Related concepts
Qwen3 235B-A22B
Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.
- Scale
- 235B total, 22B active
- Date
- 2025-04-28
- Decoder type
- Sparse MoE
- Attention
- GQA with QK-Norm
- Key detail
- High-capacity MoE design optimized for serving efficiency without a shared expert.
Related concepts
Qwen3 32B
Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.
- Scale
- 32B parameters
- Date
- 2025-04-28
- Decoder type
- Dense
- Attention
- GQA with QK-Norm
- Key detail
- Reference dense Qwen stack with QK-Norm and 8 KV heads.
Related concepts
Qwen3 4B
Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.
- Scale
- 4B parameters
- Date
- 2025-04-28
- Decoder type
- Dense
- Attention
- GQA with QK-Norm
- Key detail
- Compact Qwen3 dense stack with QK-Norm and a 151k vocabulary.
Related concepts
Qwen3 8B
Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.
- Scale
- 8B parameters
- Date
- 2025-04-28
- Decoder type
- Dense
- Attention
- GQA with QK-Norm
- Key detail
- Reference Qwen3 dense stack with QK-Norm and 8 KV heads.
Related concepts
SmolLM3 3B
Compact dense model that experiments with leaving out positional encodings in selected layers.
- Scale
- 3B parameters
- Date
- 2025-06-19
- Decoder type
- Dense
- Attention
- GQA with periodic NoPE layers
- Key detail
- Every fourth layer omits RoPE to test a NoPE-style cadence.
Related concepts
Kimi K2
Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.
- Scale
- 1T total, 32B active
- Date
- 2025-07-10
- Decoder type
- Sparse MoE
- Attention
- MLA
- Key detail
- More experts and fewer MLA heads than DeepSeek V3.
Related concepts
GLM-4.5 355B
Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.
- Scale
- 355B total, 32B active
- Date
- 2025-07-28
- Decoder type
- Sparse MoE
- Attention
- GQA with QK-Norm
- Key detail
- Starts with three dense layers before MoE routing and keeps a shared expert.
Related concepts
GPT-OSS 120B
Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.
- Scale
- 120B parameters
- Date
- 2025-08-04
- Decoder type
- Sparse MoE
- Attention
- GQA with alternating sliding-window and global layers
- Key detail
- Shared architectural template scaled up for OpenAI's flagship open-weight release.
Related concepts
GPT-OSS 20B
OpenAI's smaller open-weight MoE model favors width and alternating local/global attention.
- Scale
- 20B total, 3.6B active
- Date
- 2025-08-04
- Decoder type
- Sparse MoE
- Attention
- GQA with alternating sliding-window and global layers
- Key detail
- Wider and shallower than Qwen3, with attention bias and sink mechanisms.
Related concepts
Grok 2.5 270B
Rare production-model release that shows an older MoE style with fewer, larger experts.
- Scale
- 270B parameters
- Date
- 2025-08-22
- Decoder type
- Sparse MoE
- Attention
- GQA
- Key detail
- Adds an always-on SwiGLU path that effectively behaves like a shared expert.
Related concepts
Qwen3 Next 80B-A3B
Efficiency-focused Qwen refresh that swaps standard attention for a DeltaNet-attention hybrid.
- Scale
- 80B total, 3B active
- Date
- 2025-09-09
- Decoder type
- Sparse hybrid
- Attention
- 3:1 Gated DeltaNet and Gated Attention
- Key detail
- Adds many more experts, a shared expert, and a native 262k context.
Related concepts
MiniMax M2 230B
MiniMax's flagship returns to full attention and looks like a leaner, sparser cousin of Qwen3.
- Scale
- 230B total, 10B active
- Date
- 2025-10-23
- Decoder type
- Sparse MoE
- Attention
- GQA with QK-Norm and partial RoPE
- Key detail
- Uses per-layer QK-Norm and much sparser MoE routing than Qwen3.
Related concepts
Kimi Linear 48B-A3B
Linear-attention hybrid that keeps a transformer backbone but replaces most full-attention layers.
- Scale
- 48B total, 3B active
- Date
- 2025-10-30
- Decoder type
- Sparse hybrid
- Attention
- 3:1 Kimi Delta Attention and MLA
- Key detail
- Uses NoPE in MLA layers and channel-wise gating for long-context efficiency.
Related concepts
OLMo 3 32B
Scaled-up OLMo 3 keeps the same block design but moves to grouped-query attention.
- Scale
- 32B parameters
- Date
- 2025-11-20
- Decoder type
- Dense
- Attention
- GQA with QK-Norm and 3:1 sliding-window/global attention
- Key detail
- Keeps post-norm while scaling width and applying YaRN only on global layers.
Related concepts
OLMo 3 7B
New transparent Allen AI model that keeps OLMo's post-norm flavor while modernizing context handling.
- Scale
- 7B parameters
- Date
- 2025-11-20
- Decoder type
- Dense
- Attention
- MHA with QK-Norm and 3:1 sliding-window/global attention
- Key detail
- Retains post-norm, keeps MHA, and applies YaRN only on global layers.
Related concepts
DeepSeek V3.2
DeepSeek's successor keeps the V3 template but adds sparse attention to cut long-context costs.
- Scale
- 671B total, 37B active
- Date
- 2025-12-01
- Decoder type
- Sparse MoE
- Attention
- MLA with DeepSeek Sparse Attention
- Key detail
- An evolutio
Source: Hacker News









