Building a Fast Multilingual OCR Model with Synthetic Data

NVIDIA introduces Nemotron OCR v2, a high-performance multilingual model trained on 12 million synthetic images. By programmatically generating data, the model achieves state-of-the-art accuracy and speed across diverse languages and layouts.

Synthetic data generation offers a way out of these tradeoffs. By rendering text onto images programmatically, we get both the scale of web scraping and the label purity of hand annotation. Every bounding box, transcription, and reading order relationship is known exactly because we placed it there, and we have full control over which layouts, font styles, and edge cases appear in the training set. The challenge is realism. Simulating diverse layouts and realistic document scenarios is difficult, but with the right rendering engine and strong randomization across fonts, colors, backgrounds, augmentations, and layout structures, it is possible to build enough invariance that models trained on synthetic data generalize well to real-world documents.

Using this approach, we built Nemotron OCR v2, a multilingual OCR model that is both accurate and fast. Accuracy is driven by data: 12 million synthetic training images across six languages brought NED scores from 0.56–0.92 down to 0.035–0.069 on non-English languages. Speed is driven by architecture: a shared detection backbone whose features are reused by both the recognizer and relational model, eliminating redundant computation and enabling 34.7 pages/second on a single A100 GPU. The synthetic data pipeline is generic enough to extend to any language for which fonts and source text exist.

Nemotron OCR v1 was a strong English OCR model, but it was not trained for multilingual purposes so when exposed to other languages it failed to read the documents accurately. On our SynthDoG benchmark, v1 produced Normalized Edit Distance (NED) scores between 0.56 and 0.92 for Japanese, Korean, Russian, and Chinese. At these error rates, the model output bears little resemblance to the ground truth.

| Language | Nemotron OCR v1 NED | |---|---| | Japanese | 0.723 | | Korean | 0.923 | | Russian | 0.564 | | Chinese (Simplified) | 0.784 | | Chinese (Traditional) | 0.700 |

Part of the issue was the character set. The v1 model supported only 855 characters, which simply did not cover CJK (Chinese, Japanese, Korean) or Cyrillic scripts. We ran an experiment where we expanded the character set to 14,244 characters to cover all the target languages. This helped slightly, but without sufficient training data actually containing those characters, the improvement was marginal. The model could theoretically output the right characters, but it had never learned what they looked like. The bottleneck was data, not architecture.

Collecting and annotating millions of real-world images across six languages with word-, line-, and paragraph-level bounding boxes plus reading order graphs would be prohibitively expensive. We needed a different approach. Our key insight is that the recipe for multilingual OCR training data is fundamentally language-agnostic. You need source text in the target language and fonts that can render that language's script.

We built our pipeline on a heavily modified version of SynthDoG. We extended it with multi-level bounding boxes, relation graphs for reading order, and diverse layout modes covering multi-column text, tables, and vertical text. For CJK languages, we moved to line-level recognition. Each rendered page goes through randomized augmentations including noise, blur, and color shifting to improve generalization. The full dataset contains 12.2 million samples across six languages.

Source: Hugging Face Blog