NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...5 min read

Welcome Gemma 4: Frontier multimodal intelligence on device

Share
NOW LET US Article – Welcome Gemma 4: Frontier multimodal intelligence on device

Gemma 4 marks a significant leap in open-source AI, offering high-quality multimodal capabilities including audio and images, optimized for seamless deployment across various platforms and devices.

These models are the real deal: truly open with Apache 2 licenses, high quality with pareto frontier arena scores, multimodal including audio, and sizes you can use everywhere including on-device. Gemma 4 builds on advances from previous families and makes them click together. In our tests with pre-release checkpoints we have been impressed by their capabilities, to the extent that we struggled to find good fine-tuning examples because they are so good out of the box.

We collaborated with Google and the community to make them available everywhere: transformers, llama.cpp, MLX, WebGPU, Rust; you name it. This blog post will show you how to build with your favorite tools so let us know what you think!

  • What is New with Gemma 4?
  • Overview of Capabilities and Architecture
  • Multimodal Capabilities
  • Deploy Anywhere
  • Fine-tuning & Demos
  • Try Gemma 4
  • Benchmark Results
  • Acknowledgements

Similar to Gemma-3n, Gemma 4 supports image, text, and audio inputs, and generates text responses. The text decoder is based on the Gemma model with support for long context windows. The image encoder is similar to the one from Gemma 3 but with two crucial improvements: variable aspect ratios, and configurable number of image token inputs to find your sweet spot between speed, memory, and quality. All models support images (or video) and text inputs, while the small variants (E2B and E4B) support audio as well.

Gemma 4 comes in four sizes, all base and instruction fine-tuned:

| Model | Parameter Size | Context Window | Checkpoints | |---|---|---|---| | Gemma 4 E2B | 2.3B effective, 5.1B with embeddings | 128k | base, IT | | Gemma 4 E4B | 4.5B effective, 8B with embeddings | 128k | base, IT | | Gemma 4 31B | 31B dense model | 256K | base, IT | | Gemma 4 26B A4B | mixture-of-experts with 4B activated/26B total parameters | 256K | base, IT |

Gemma 4 leverages several architecture components used in previous Gemma versions and other open models, and leaves out complex or inconclusive features such as Altup. The combination is a mix designed to be highly compatible across libraries and devices, that can efficiently support long context and agentic use cases, whilst being ideal for quantization.

As shown in the benchmarks above, this feature mix (combined with the training data and recipe) enables the 31B dense model to achieve an estimated LMArena score (text only) of 1452, while the 26B MoE reaches 1441 with just 4B active parameters 🤯. As we'll see, multimodal operation is comparatively as good as text generation, at least in informal and subjective tests.

These are the main architecture characteristics in Gemma 4:

  • Alternating local sliding-window and global full-context attention layers. Smaller dense models use sliding windows of 512 tokens while larger models use 1024 tokens.
  • Dual RoPE configurations: standard RoPE for sliding layers, proportional RoPE for global layers, to enable longer context.
  • Per-Layer Embeddings (PLE): a second embedding table that feeds a small residual signal into every decoder layer.
  • Shared KV Cache: the last N layers of the model reuse key-value states from earlier layers, eliminating redundant KV projections.
  • Vision encoder: uses learned 2D positions and multidimensional RoPE. Preserves the original aspect ratios and can encode images to a few different token budgets (70, 140, 280, 560, 1120).
  • Audio encoder: USM-style conformer with the same base architecture as the one in Gemma-3n.

One of the most distinctive features in smaller Gemma 4 models is Per-Layer Embeddings (PLE), which was introduced previously in Gemma-3n. In a standard transformer, each token gets a single embedding vector at input, and the same initial representation is what the residual stream builds on across all layers, forcing the embedding to frontload everything the model might need. PLE adds a parallel, lower-dimensional conditioning pathway alongside the main residual stream. For each token, it produces a small dedicated vector for every layer by combining two signals: a token-identity component (from an embedding lookup) and a context-aware component (from a learned projection of the main embeddings). Each decoder layer then uses its corresponding vector to modulate the hidden states via a lightweight residual block after attention and feed-forward. This gives each layer its own channel to receive token-specific information only when it becomes relevant, rather than requiring everything to be packed into a single upfront embedding. Because the PLE dimension is much smaller than the main hidden size, this adds meaningful per-layer specialization at modest parameter cost. For multimodal inputs (images, audio, video), PLE is computed before soft tokens are merged into the embedding sequence — since PLE relies on token IDs that are lost once multimodal features replace the placeholders. Multimodal positions use the pad token ID, effectively receiving neutral per-layer signals.

The shared KV cache is an efficiency optimization that reduces both compute and memory during inference. The last num_kv_shared_layers layers of the model don't compute their own key and value projections. Instead, they reuse the K and V tensors from the last non-shared layer of the same attention type (sliding or full).

In practice, this has a minimal impact on quality while being much more efficient (in terms of both memory and compute) for long context generation and on-device use.

We saw in our tests that Gemma 4 supports comprehensive multimodal capabilities out of the box. We don't know what was the training mix, but we had success using it for tasks such as OCR, speech-to-text, object detection, or pointing. It also supports text-only and multimodal function calling, reasoning, code completion and correction.

Here, we show a few inference examples across different model sizes. You can run them conveniently with this notebook. We encourage you to try the demos and share them below this blog!

We test Gemma 4 on GUI element detection and pointing across different sizes, with the following image and text prompt: "What's the bounding box for the "view recipe" element in the image?"

With this prompt, the model natively responds in JSON format with the detected bounding boxes - no need for specific instructions or grammar-constrained generation. We found the coordinates refer to an image size of 1000x1000, relative to the input dimensions.

We visualize the outputs below for your convenience. We parse the bounding boxes from the returned JSON: json\n[\n {"box_2d": [171, 75, 245, 308], "label": "view recipe element"}\n]\n

We test models to detect everyday objects, here we ask them to detect the bike and compare different model outputs. As in the previous case, we parse the bounding box from the json and translate to image space coordinates.

We asked Gemma 4 to write HTML code to reconstruct a page we made with Gemini 3. Below you can find the code to do this, we enable thinking and ask each model to generate up to 4000 new tokens, to make it foolproof.

Smaller Gemma 4 models can take in videos with audio while larger ones can take in videos without audio. While the models are not explicitly post-trained on videos, they can understand.

© 2026 Now Let Us. All rights reserved.

Source: Hugging Face Blog

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – GLM 5.2 Is Out

dev-tools

GLM 5.2 Is Out

Zhipu AI has officially released GLM-5.2, its most powerful open-source model to date, featuring a 1M context window and advanced long-horizon task capabilities. The release underscores Zhipu's commitment to open-source AI and global scientific collaboration amid rising technological restrictions.

NOW LET US Related – Treating pancreatic tumours may have revealed cancer's master switch

dev-tools

Treating pancreatic tumours may have revealed cancer's master switch

A promising new drug called daraxonrasib has shown breakthrough results in treating pancreatic cancer, doubling median survival times. This achievement could pave the way for an entirely new class of cancer treatments.

NOW LET US Related – Leaving Mozilla

dev-tools

Leaving Mozilla

A poignant and candid reflection from a 15-year Mozilla veteran upon their departure. The author highlights the leadership's missteps in trying to emulate tech giants and urges Mozilla to return to its core values: community and uniqueness.

NOW LET US Related – Shepherd's Dog: A Game by the Most Dangerous AI Model

dev-tools

Shepherd's Dog: A Game by the Most Dangerous AI Model

A developer tested Anthropic's latest, supposedly 'too dangerous' AI model by asking it to build a long-held game idea in a single shot. The model succeeded, generating a complete 2,319-line game after a 45-minute reasoning session.

NOW LET US Related – Open source AI must win

dev-tools

Open source AI must win

If artificial intelligence becomes a utility rented only from a few closed institutions, humanity loses its operational freedom. Open-source AI is a vital infrastructure for the future of our digital society.

NOW LET US Related – Statement on US government directive to suspend access to Fable 5 and Mythos 5

dev-tools

Statement on US government directive to suspend access to Fable 5 and Mythos 5

The US government has issued an export control directive forcing Anthropic to suspend all access to its Fable 5 and Mythos 5 models due to national security concerns, a move the AI safety startup strongly disputes.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.

Welcome Gemma 4: Frontier multimodal intelligence on device | Gemma 4 chính thức ra mắt: Đỉnh cao AI đa phương thức, tối ưu cho mọi thiết bị | Now Let Us | NOW LET US