NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...1 min read

Hugging Face and Cerebras bring Gemma 4 to real-time voice AI

Share
NOW LET US Article – Hugging Face and Cerebras bring Gemma 4 to real-time voice AI

Hugging Face and Cerebras have partnered to power a real-time speech-to-speech pipeline using Google's Gemma 4 31B. By leveraging Cerebras's ultra-fast inference, the system eliminates latency bottlenecks, enabling natural, human-like voice interactions for robots and virtual assistants.

HF Realtime Voice

Voice chat over WebSocket against a HF speech-to-speech

The result is a speech-to-speech experience that feels dramatically more natural. Instead of waiting for an AI to respond, conversations flow with the responsiveness users expect from human interaction.

The demo is built as a real-time speech-to-speech pipeline. Each part of the system is modular, open, and replaceable, making it easy for developers to adapt the stack for different assistants, robots, products, or research projects.

This creates a fully open speech-to-speech loop:

Speech input
-> speech recognition with Nvidia's Parakeet
-> Gemma 4 VLM inference on Cerebras
-> text-to-speech with Alibaba's Qwen3TTS
-> spoken response

The architecture brings together the strength of the open-source AI ecosystem: Cerebras for fast inference, Google DeepMind’s Gemma 4 31B for the language model, and Qwen for text-to-speech. Every layer can be inspected, modified, and extended by the developers.

Today, some production systems see a reasonable median latency while still experiencing frustrating multi-second delays at the P95. Those delays become even more noticeable when tool calls or multimodal steps require multiple turns.

Cerebras helps solve one of the most important bottlenecks in the stack: the language-model response time. By making inference dramatically faster and more stable, Cerebras allows the rest of the Hugging Face pipeline to shine.

That stability is especially important at the long tail. Many systems can deliver acceptable median response times, but occasional slow responses still make conversations feel unreliable.

This same Hugging Face speech-to-speech pipeline already powers Reachy Mini robots, with more than 9,000 robots in the wild. For robots, voice assistants, and embodied AI, responsiveness is not a cosmetic improvement. It is what makes the interaction feel alive.

The motivation to use Cerebras is therefore not simply cost reduction. It is low latency, predictable performance, and the ability to create real-time experiences that feel natural at scale.

This collaboration reflects a shared belief that the future of AI will be both open and performant. Open-source models, open infrastructure, and breakthrough inference speed together create a foundation for the next generation of conversational AI.

We invite developers to explore the demo, experiment with the code, and help shape what comes next for real-time voice AI.

© 2026 Now Let Us. All rights reserved.

Source: Hugging Face Blog

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Half-Baked Product

dev-tools

Half-Baked Product

A satirical yet realistic story of a hardware startup's journey. From flawless Excel spreadsheets and multi-million dollar VC pitches to the harsh reality of engineering compromises and enterprise demands.

NOW LET US Related – The Safari MCP server for web developers

dev-tools

The Safari MCP server for web developers

Apple has introduced the Safari MCP server in Safari Technology Preview 247, allowing AI agents to connect directly to the browser for automated debugging. This tool helps developers optimize performance, check compatibility, and test accessibility right from their terminal.

NOW LET US Related – CarPlay Is Additive

dev-tools

CarPlay Is Additive

Rivian's refusal to support Apple CarPlay is driving away potential customers. CarPlay is an optional, additive feature that enhances the driving experience without replacing the car's native system.

NOW LET US Related – An American Privacy Emergency

dev-tools

An American Privacy Emergency

A controversial directive from the U.S. Department of Commerce bans modern privacy-preserving techniques like differential privacy, threatening both individual privacy and the utility of national statistical data.

NOW LET US Related – crustc: entirety of `rustc`, translated to C

dev-tools

crustc: entirety of `rustc`, translated to C

crustc is a fully functional Rust compiler translated entirely into C, allowing it to be built using GCC and make. Powered by the 'cilly' toolchain, this project aims to bring Rust support to legacy and obscure platforms that lack LLVM backend support.

NOW LET US Related – Exapunks (2018)

dev-tools

Exapunks (2018)

The acclaimed programming puzzle game EXAPUNKS brings exciting news by re-releasing its physical zines via print-on-demand. Additionally, the Axiom VirtualNetwork+ tool allows players to design their own hacking challenges using JavaScript.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.