NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...1 min read

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Share
NOW LET US Article – Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

A comprehensive guide on finetuning multimodal models for specialized tasks like Visual Document Retrieval, demonstrating significant performance gains over larger base models.

train or finetune these multimodal models on your own data.

As a practical example, I'll walk through finetuning Qwen/Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR), the task of retrieving relevant document pages (as images, with charts, tables, and layout intact) for a given text query. The resulting tomaarsen/Qwen3-VL-Embedding-2B-vdr demonstrates how much performance you can gain by finetuning on your own domain. On my evaluation data, the finetuned model achieves an NDCG@10 of 0.947 compared to the base model's 0.888, and outperforms all existing VDR models I tested against, including models up to 4x its size.

General-purpose multimodal embedding models like Qwen/Qwen3-VL-Embedding-2B are trained on diverse data to perform well across a wide range of languages and tasks. But this generality means the model is rarely the best choice for any specific task. By finetuning on domain-specific data, the model can learn specialized patterns like document layouts and charts.

Training multimodal Sentence Transformer models involves the same components as training text-only models: Model, Dataset, Loss Function, Training Arguments, Evaluator, and Trainer. The key difference is that your datasets contain images alongside text, and the model's processor handles the image preprocessing automatically.

You can finetune an existing multimodal embedding model or start from a fresh VLM checkpoint. Alternatively, you can use the Router module to compose separate encoders for different modalities, which is useful when you want to use lightweight, specialized encoders rather than a large VLM.

© 2026 Now Let Us. All rights reserved.

Source: Hugging Face Blog

Advertisement
Ad slot ready: 5887729102

More in this category

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.