NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...4 min read

Cohere Transcribe: Speech Recognition

Share
NOW LET US Article – Cohere Transcribe: Speech Recognition

Cohere announces Transcribe, a state-of-the-art open-source ASR model that currently ranks #1 for accuracy on HuggingFace’s Open ASR Leaderboard, outperforming models like Whisper Large v3.

Cohere is announcing Transcribe, a state-of-the-art automatic speech recognition (ASR) model that is open source and available today for download.

Speech is rapidly becoming a core modality for AI-enabled workloads and automations — from meeting transcription and speech analytics to real-time customer support agents.

Our objective was straightforward: push the frontier of dedicated ASR model accuracy under practical conditions. The model was trained from scratch with a deliberate focus on minimizing word error rate (WER), while keeping production readiness top-of-the-mind. In other words, not just a research artifact, but a system designed for everyday use.

Cohere Transcribe reflects that intent. It is available for open-source use with full infrastructure control, maintains a manageable inference footprint suitable for practical GPU and local utilization, delivers best-in-class serving efficiency, and is also available via Model Vault — Cohere’s secure, fully managed model inference platform.

Cohere Transcribe currently ranks #1 for accuracy on HuggingFace’s Open ASR Leaderboard, setting a new benchmark for real-world transcription performance.

This marks our zero-to-one in bringing high-performance speech recognition into enterprise AI workflows. Read on to learn more.

Model overview

| Name | cohere-transcribe-03-2026 | |---|---| | Architecture | conformer-based encoder-decoder | | Input | audio waveform → log-Mel spectrogram | | Output | transcribed text | | Model size | 2B | | Model | a large Conformer encoder extracts acoustic representations, followed by a lightweight Transformer decoder for token generation | | Training objective | standard supervised cross-entropy on output tokens; trained from scratch | | Languages | trained on 14 languages | | License | Apache 2.0 |

Model performance

Accuracy

Cohere Transcribe is the latest standard for English speech recognition accuracy. It leads the HuggingFace Open ASR Leaderboard with an average word error rate of just 5.42%, outperforming all open- and closed-source dedicated ASR alternatives, including Whisper Large v3, ElevenLabs Scribe v2, and Qwen3-ASR-1.7B. This captures the model’s versatile capability across real-world speech tasks, such as robustness to multiple-speaker environments, boardroom-style acoustics (e.g. AMI dataset), and diverse accents (e.g. Voxpopuli dataset).

| Model | Average WER | AMI | Earnings 22 | Gigaspeech | LS clean | LS other | SPGISpeech | Tedlium | Voxpopuli | |---|---|---|---|---|---|---|---|---|---| | Cohere Transcribe | 5.42 | 8.13 | 10.86 | 9.34 | 1.25 | 2.37 | 3.08 | 2.49 | 5.87 | | Zoom Scribe v1 | 5.47 | 10.03 | 9.53 | 9.61 | 1.63 | 2.81 | 1.59 | 3.22 | 5.37 | | IBM Granite 4.0 1B Speech | 5.52 | 8.44 | 8.48 | 10.14 | 1.42 | 2.85 | 3.89 | 3.10 | 5.84 | | NVIDIA Canary Qwen 2.5B | 5.63 | 10.19 | 10.45 | 9.43 | 1.61 | 3.10 | 1.90 | 2.71 | 5.66 | | Qwen3-ASR-1.7B | 5.76 | 10.56 | 10.25 | 8.74 | 1.63 | 3.40 | 2.84 | 2.28 | 6.35 | | ElevenLabs Scribe v2 | 5.83 | 11.86 | 9.43 | 9.11 | 1.54 | 2.83 | 2.68 | 2.37 | 6.80 | | Kyutai STT 2.6B | 6.40 | 12.17 | 10.99 | 9.81 | 1.70 | 4.32 | 2.03 | 3.35 | 6.79 | | OpenAI Whisper Large v3 | 7.44 | 15.95 | 11.29 | 10.02 | 2.01 | 3.91 | 2.94 | 3.86 | 9.54 | | Voxtral Mini 4B Realtime 2602 | 7.68 | 17.07 | 11.84 | 10.38 | 2.08 | 5.52 | 2.42 | 3.79 | 8.34 |

Critically, these gains aren’t limited to benchmark datasets. We see the same state-of-the-art performance carried over into human evaluations, where trained reviewers assess transcription quality across real-world audio for accuracy, coherence, and usability. Consistency across both evaluation methods reinforces that Cohere Transcribe’s performance translates reliably from controlled tests to practical enterprise settings.

Throughput

In production settings, ASR systems must operate under strict latency and throughput constraints; even if accurate, slow or resource-intensive transcription can directly impact user experience, operational efficiency, and cost.

Transcribe extends the Pareto frontier, delivering state-of-the-art accuracy (low WER) while sustaining best-in-class throughput (high RTFx) within the 1B+ parameter model cohort.

“We’re genuinely impressed with what Cohere has built with Transcribe. The speed is exceptional — turning minutes of audio into usable transcripts in seconds — and it immediately unlocks new possibilities for real-time products and workflows.

In our testing, the model handled everyday speech very well and delivered strong, reliable transcription quality. The overall experience has been smooth and easy to work with. We’re excited to be partnering with Cohere and to continue exploring what we can build with this technology.” Paige Dickie, Vice-President Radical Ventures

Zero to one, and beyond.

We are working towards deeper integration of Cohere Transcribe with North, Cohere’s AI agent orchestration platform. With planned updates, Cohere Transcribe will evolve from a high-accuracy transcription model into a broader foundation for enterprise speech intelligence.

Getting started.

Cohere Transcribe is now available for download on Hugging Face. Follow the setup instructions to run the model locally, or even in edge environments.

You can also access Cohere Transcribe via our API for free, low-setup experimentation subject to rate limits. See the documentation for usage details and integration guidance.

For production deployment without rate limits, provision a dedicated Model Vault. This enables low-latency, private cloud inference without having to manage infrastructure. Pricing is calculated per hour-instance, with discounted plans for longer-term commitments. Contact our team to discuss your requirements.

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – GLM 5.2 Is Out

dev-tools

GLM 5.2 Is Out

Zhipu AI has officially released GLM-5.2, its most powerful open-source model to date, featuring a 1M context window and advanced long-horizon task capabilities. The release underscores Zhipu's commitment to open-source AI and global scientific collaboration amid rising technological restrictions.

NOW LET US Related – Noise infusion banned from statistical products published by Census Bureau

dev-tools

Noise infusion banned from statistical products published by Census Bureau

The U.S. Department of Commerce has banned "noise infusion" from statistical products published by the Census Bureau, a decision that could have severe consequences for both data utility and privacy protection.

NOW LET US Related – Treating pancreatic tumours may have revealed cancer's master switch

dev-tools

Treating pancreatic tumours may have revealed cancer's master switch

A promising new drug called daraxonrasib has shown breakthrough results in treating pancreatic cancer, doubling median survival times. This achievement could pave the way for an entirely new class of cancer treatments.

NOW LET US Related – Every Frame Perfect

dev-tools

Every Frame Perfect

In UI design, perfection isn't just about the start and end states, but every single transition frame in between. Polishing these micro-interactions is key to building user trust.

NOW LET US Related – Leaving Mozilla

dev-tools

Leaving Mozilla

A poignant and candid reflection from a 15-year Mozilla veteran upon their departure. The author highlights the leadership's missteps in trying to emulate tech giants and urges Mozilla to return to its core values: community and uniqueness.

NOW LET US Related – Shepherd's Dog: A Game by the Most Dangerous AI Model

dev-tools

Shepherd's Dog: A Game by the Most Dangerous AI Model

A developer tested Anthropic's latest, supposedly 'too dangerous' AI model by asking it to build a long-held game idea in a single shot. The model succeeded, generating a complete 2,319-line game after a 45-minute reasoning session.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.