NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...6 min read

LLM Architecture Gallery

Share
NOW LET US Article – LLM Architecture Gallery

A collection of architecture fact sheets for leading Large Language Models, detailing their scale, decoder type, attention mechanisms, and key design choices.

This page collects architecture figures and fact sheets from The Big LLM Architecture Comparison and A Dream of Spring for Open-Weight LLMs. It focuses on the architecture panels only. Click a figure to enlarge it and use the model title to jump to the corresponding article section.

If you spot an inaccurate fact sheet, mislabeled architecture, or broken link, please file an issue here: Architecture Gallery issue tracker.

Llama 3 8B

Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.

  • Scale
  • 8B parameters
  • Date
  • 2024-04-18
  • Decoder type
  • Dense
  • Attention
  • GQA with RoPE
  • Key detail
  • Pre-norm baseline; wider than OLMo 2 at a similar scale.

Related concepts

OLMo 2 7B

Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.

  • Scale
  • 7B parameters
  • Date
  • 2024-11-25
  • Decoder type
  • Dense
  • Attention
  • MHA with QK-Norm
  • Key detail
  • Uses inside-residual post-norm instead of the usual pre-norm layout.

Related concepts

DeepSeek V3

DeepSeek's flagship template kicked off the recent wave of large open MoE models.

  • Scale
  • 671B total, 37B active
  • Date
  • 2024-12-26
  • Decoder type
  • Sparse MoE
  • Attention
  • MLA
  • Key detail
  • Uses a dense prefix plus a shared expert to keep a very large model practical at inference.

Related concepts

DeepSeek R1

Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.

  • Scale
  • 671B total, 37B active
  • Date
  • 2025-01-20
  • Decoder type
  • Sparse MoE
  • Attention
  • MLA
  • Key detail
  • Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe.

Related concepts

Gemma 3 27B

Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.

  • Scale
  • 27B parameters
  • Date
  • 2025-03-11
  • Decoder type
  • Dense
  • Attention
  • GQA with QK-Norm and 5:1 sliding-window/global attention
  • Key detail
  • Built around a 27B sweet spot with heavier local attention and a large multilingual vocabulary.

Related concepts

Mistral Small 3.1 24B

Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.

  • Scale
  • 24B parameters
  • Date
  • 2025-03-18
  • Decoder type
  • Dense
  • Attention
  • Standard GQA
  • Key detail
  • Latency-focused design with a smaller KV cache and fewer layers than Gemma 3 27B.

Related concepts

Llama 4 Maverick

Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.

  • Scale
  • 400B total, 17B active
  • Date
  • 2025-04-05
  • Decoder type
  • Sparse MoE
  • Attention
  • GQA
  • Key detail
  • Alternates dense and MoE blocks and uses fewer, larger experts than DeepSeek V3.

Related concepts

Qwen3 235B-A22B

Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.

  • Scale
  • 235B total, 22B active
  • Date
  • 2025-04-28
  • Decoder type
  • Sparse MoE
  • Attention
  • GQA with QK-Norm
  • Key detail
  • High-capacity MoE design optimized for serving efficiency without a shared expert.

Related concepts

Qwen3 32B

Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.

  • Scale
  • 32B parameters
  • Date
  • 2025-04-28
  • Decoder type
  • Dense
  • Attention
  • GQA with QK-Norm
  • Key detail
  • Reference dense Qwen stack with QK-Norm and 8 KV heads.

Related concepts

Qwen3 4B

Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.

  • Scale
  • 4B parameters
  • Date
  • 2025-04-28
  • Decoder type
  • Dense
  • Attention
  • GQA with QK-Norm
  • Key detail
  • Compact Qwen3 dense stack with QK-Norm and a 151k vocabulary.

Related concepts

Qwen3 8B

Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.

  • Scale
  • 8B parameters
  • Date
  • 2025-04-28
  • Decoder type
  • Dense
  • Attention
  • GQA with QK-Norm
  • Key detail
  • Reference Qwen3 dense stack with QK-Norm and 8 KV heads.

Related concepts

SmolLM3 3B

Compact dense model that experiments with leaving out positional encodings in selected layers.

  • Scale
  • 3B parameters
  • Date
  • 2025-06-19
  • Decoder type
  • Dense
  • Attention
  • GQA with periodic NoPE layers
  • Key detail
  • Every fourth layer omits RoPE to test a NoPE-style cadence.

Related concepts

Kimi K2

Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.

  • Scale
  • 1T total, 32B active
  • Date
  • 2025-07-10
  • Decoder type
  • Sparse MoE
  • Attention
  • MLA
  • Key detail
  • More experts and fewer MLA heads than DeepSeek V3.

Related concepts

GLM-4.5 355B

Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.

  • Scale
  • 355B total, 32B active
  • Date
  • 2025-07-28
  • Decoder type
  • Sparse MoE
  • Attention
  • GQA with QK-Norm
  • Key detail
  • Starts with three dense layers before MoE routing and keeps a shared expert.

Related concepts

GPT-OSS 120B

Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.

  • Scale
  • 120B parameters
  • Date
  • 2025-08-04
  • Decoder type
  • Sparse MoE
  • Attention
  • GQA with alternating sliding-window and global layers
  • Key detail
  • Shared architectural template scaled up for OpenAI's flagship open-weight release.

Related concepts

GPT-OSS 20B

OpenAI's smaller open-weight MoE model favors width and alternating local/global attention.

  • Scale
  • 20B total, 3.6B active
  • Date
  • 2025-08-04
  • Decoder type
  • Sparse MoE
  • Attention
  • GQA with alternating sliding-window and global layers
  • Key detail
  • Wider and shallower than Qwen3, with attention bias and sink mechanisms.

Related concepts

Grok 2.5 270B

Rare production-model release that shows an older MoE style with fewer, larger experts.

  • Scale
  • 270B parameters
  • Date
  • 2025-08-22
  • Decoder type
  • Sparse MoE
  • Attention
  • GQA
  • Key detail
  • Adds an always-on SwiGLU path that effectively behaves like a shared expert.

Related concepts

Qwen3 Next 80B-A3B

Efficiency-focused Qwen refresh that swaps standard attention for a DeltaNet-attention hybrid.

  • Scale
  • 80B total, 3B active
  • Date
  • 2025-09-09
  • Decoder type
  • Sparse hybrid
  • Attention
  • 3:1 Gated DeltaNet and Gated Attention
  • Key detail
  • Adds many more experts, a shared expert, and a native 262k context.

Related concepts

MiniMax M2 230B

MiniMax's flagship returns to full attention and looks like a leaner, sparser cousin of Qwen3.

  • Scale
  • 230B total, 10B active
  • Date
  • 2025-10-23
  • Decoder type
  • Sparse MoE
  • Attention
  • GQA with QK-Norm and partial RoPE
  • Key detail
  • Uses per-layer QK-Norm and much sparser MoE routing than Qwen3.

Related concepts

Kimi Linear 48B-A3B

Linear-attention hybrid that keeps a transformer backbone but replaces most full-attention layers.

  • Scale
  • 48B total, 3B active
  • Date
  • 2025-10-30
  • Decoder type
  • Sparse hybrid
  • Attention
  • 3:1 Kimi Delta Attention and MLA
  • Key detail
  • Uses NoPE in MLA layers and channel-wise gating for long-context efficiency.

Related concepts

OLMo 3 32B

Scaled-up OLMo 3 keeps the same block design but moves to grouped-query attention.

  • Scale
  • 32B parameters
  • Date
  • 2025-11-20
  • Decoder type
  • Dense
  • Attention
  • GQA with QK-Norm and 3:1 sliding-window/global attention
  • Key detail
  • Keeps post-norm while scaling width and applying YaRN only on global layers.

Related concepts

OLMo 3 7B

New transparent Allen AI model that keeps OLMo's post-norm flavor while modernizing context handling.

  • Scale
  • 7B parameters
  • Date
  • 2025-11-20
  • Decoder type
  • Dense
  • Attention
  • MHA with QK-Norm and 3:1 sliding-window/global attention
  • Key detail
  • Retains post-norm, keeps MHA, and applies YaRN only on global layers.

Related concepts

DeepSeek V3.2

DeepSeek's successor keeps the V3 template but adds sparse attention to cut long-context costs.

  • Scale
  • 671B total, 37B active
  • Date
  • 2025-12-01
  • Decoder type
  • Sparse MoE
  • Attention
  • MLA with DeepSeek Sparse Attention
  • Key detail
  • An evolutio
© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.