LLM Architecture Gallery

A collection of architecture fact sheets for leading Large Language Models, detailing their scale, decoder type, attention mechanisms, and key design choices.

This page collects architecture figures and fact sheets from The Big LLM Architecture Comparison and A Dream of Spring for Open-Weight LLMs. It focuses on the architecture panels only. Click a figure to enlarge it and use the model title to jump to the corresponding article section.

If you spot an inaccurate fact sheet, mislabeled architecture, or broken link, please file an issue here: Architecture Gallery issue tracker.

Llama 3 8B

Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.

Scale
8B parameters
Date
2024-04-18
Decoder type
Dense
Attention
GQA with RoPE
Key detail
Pre-norm baseline; wider than OLMo 2 at a similar scale.

Related concepts

OLMo 2 7B

Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.

Scale
7B parameters
Date
2024-11-25
Decoder type
Dense
Attention
MHA with QK-Norm
Key detail
Uses inside-residual post-norm instead of the usual pre-norm layout.

Related concepts

DeepSeek V3

DeepSeek's flagship template kicked off the recent wave of large open MoE models.

Scale
671B total, 37B active
Date
2024-12-26
Decoder type
Sparse MoE
Attention
MLA
Key detail
Uses a dense prefix plus a shared expert to keep a very large model practical at inference.

Related concepts

DeepSeek R1

Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.

Scale
671B total, 37B active
Date
2025-01-20
Decoder type
Sparse MoE
Attention
MLA
Key detail
Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe.

Related concepts

Gemma 3 27B

Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.

Scale
27B parameters
Date
2025-03-11
Decoder type
Dense
Attention
GQA with QK-Norm and 5:1 sliding-window/global attention
Key detail
Built around a 27B sweet spot with heavier local attention and a large multilingual vocabulary.

Related concepts

Mistral Small 3.1 24B

Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.

Scale
24B parameters
Date
2025-03-18
Decoder type
Dense
Attention
Standard GQA
Key detail
Latency-focused design with a smaller KV cache and fewer layers than Gemma 3 27B.

Related concepts

Llama 4 Maverick

Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.

Scale
400B total, 17B active
Date
2025-04-05
Decoder type
Sparse MoE
Attention
GQA
Key detail
Alternates dense and MoE blocks and uses fewer, larger experts than DeepSeek V3.

Related concepts

Qwen3 235B-A22B

Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.

Scale
235B total, 22B active
Date
2025-04-28
Decoder type
Sparse MoE
Attention
GQA with QK-Norm
Key detail
High-capacity MoE design optimized for serving efficiency without a shared expert.

Related concepts

Qwen3 32B

Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.

Scale
32B parameters
Date
2025-04-28
Decoder type
Dense
Attention
GQA with QK-Norm
Key detail
Reference dense Qwen stack with QK-Norm and 8 KV heads.

Related concepts

Qwen3 4B

Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.

Scale
4B parameters
Date
2025-04-28
Decoder type
Dense
Attention
GQA with QK-Norm
Key detail
Compact Qwen3 dense stack with QK-Norm and a 151k vocabulary.

Related concepts

Qwen3 8B

Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.

Scale
8B parameters
Date
2025-04-28
Decoder type
Dense
Attention
GQA with QK-Norm
Key detail
Reference Qwen3 dense stack with QK-Norm and 8 KV heads.

Related concepts

SmolLM3 3B

Compact dense model that experiments with leaving out positional encodings in selected layers.

Scale
3B parameters
Date
2025-06-19
Decoder type
Dense
Attention
GQA with periodic NoPE layers
Key detail
Every fourth layer omits RoPE to test a NoPE-style cadence.

Related concepts

Kimi K2

Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.

Scale
1T total, 32B active
Date
2025-07-10
Decoder type
Sparse MoE
Attention
MLA
Key detail
More experts and fewer MLA heads than DeepSeek V3.

Related concepts

GLM-4.5 355B

Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.

Scale
355B total, 32B active
Date
2025-07-28
Decoder type
Sparse MoE
Attention
GQA with QK-Norm
Key detail
Starts with three dense layers before MoE routing and keeps a shared expert.

Related concepts

GPT-OSS 120B

Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.

Scale
120B parameters
Date
2025-08-04
Decoder type
Sparse MoE
Attention
GQA with alternating sliding-window and global layers
Key detail
Shared architectural template scaled up for OpenAI's flagship open-weight release.

Related concepts