DEV-TOOLSApril 16, 20261 min read9 views

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

A comprehensive guide on finetuning multimodal models for specialized tasks like Visual Document Retrieval, demonstrating significant performance gains over larger base models.

train or finetune these multimodal models on your own data.

As a practical example, I'll walk through finetuning Qwen/Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR), the task of retrieving relevant document pages (as images, with charts, tables, and layout intact) for a given text query. The resulting tomaarsen/Qwen3-VL-Embedding-2B-vdr demonstrates how much performance you can gain by finetuning on your own domain. On my evaluation data, the finetuned model achieves an NDCG@10 of 0.947 compared to the base model's 0.888, and outperforms all existing VDR models I tested against, including models up to 4x its size.

General-purpose multimodal embedding models like Qwen/Qwen3-VL-Embedding-2B are trained on diverse data to perform well across a wide range of languages and tasks. But this generality means the model is rarely the best choice for any specific task. By finetuning on domain-specific data, the model can learn specialized patterns like document layouts and charts.

Training multimodal Sentence Transformer models involves the same components as training text-only models: Model, Dataset, Loss Function, Training Arguments, Evaluator, and Trainer. The key difference is that your datasets contain images alongside text, and the model's processor handles the image preprocessing automatically.

You can finetune an existing multimodal embedding model or start from a fresh VLM checkpoint. Alternatively, you can use the Router module to compose separate encoders for different modalities, which is useful when you want to use lightweight, specialized encoders rather than a large VLM.

Source: Hugging Face Blog

More in this category

dev-tools

OpenAI and Hugging Face address security incident during model evaluation

OpenAI and Hugging Face detailed a security incident where an AI agent exploited vulnerabilities during internal cyber capability evaluations, underscoring the critical need for advanced safeguards as AI models gain sophisticated technical capabilities.

dev-tools

The State of Simulation for Physical AI: An Overview

Data availability is the primary bottleneck for Physical AI. GPU-accelerated simulation platforms solve this by generating scalable synthetic datasets to train, test, and deploy next-generation robotics systems.

dev-tools

Linux kernel will support $ORIGIN, sort of

Farid Zakaria shares his journey of proposing a patch to the Linux kernel to support relocatable binaries in Nix, which evolved into a powerful eBPF-based solution for binfmt_misc.

dev-tools

Five US tech giants' hidden debts soar to $1.65T on opaque AI funding

A Nikkei study reveals that off-balance-sheet debts at five major US tech companies have surged eightfold to $1.65 trillion, driven by massive AI investments like data center leases and GPU contracts.

dev-tools

Running Doom on Our Custom CPU and Going Viral

Two developers successfully built a custom CPU from scratch at the logic gate level, deployed it on an FPGA, and optimized its memory architecture to run the classic game DOOM.

dev-tools

Incremental – A library for incremental computations

Incremental is a library designed for building complex computations that update efficiently when inputs change. Inspired by self-adjusting computation research, it is highly useful for large-scale calculations, GUI views, and data synchronization.