Build a Domain-Specific Embedding Model in Under a Day

With a single GPU and less than a day of training time, you can transform a general-purpose embedding model into one that truly understands your domain, no manual labeling required.

With a single GPU and less than a day of training time, you can transform a general-purpose embedding model into one that truly understands your domain, no manual labeling required. To help you hit the ground running, we are also releasing a ready-to-use synthetic training dataset generated from NVIDIA's public documentation using this exact pipeline. Using this data and the recipe, we saw over 10% improvement in both Recall@10 and NDCG@10. Atlassian applied this recipe to fine-tune on their JIRA dataset, increasing Recall@60 from 0.751 to 0.951, a 26% improvement - on a single GPU.

NeMo Data Designer for synthetic data generation
NeMo Automodel for embedding model training
BEIR for Information retrieval evaluation
NeMo Export-Deploy for ONNX/TensorRT conversion
NVIDIA NIM for production inference serving
A directory of domain documents (text files - .txt, .md, or similar)
A valid NVIDIA API key (free at build.nvidia.com)
NVIDIA Ampere GPU or newer with at least 80GB memory (with Compute Capability >= 8.0)
This tutorial has been tested on 1xA100 (80GB), and 1xH100 (80GB)

By the end of this post, you’ll know how to answer:

📄 Generate training data from domain documents without labeled data

🎯 Use hard negative mining for effective contrastive training

🔗 Improve embedding quality with multi-hop queries

⚙️ Fine-tune a bi-encoder embedding model

📊 Evaluate whether fine-tuning improves retrieval

🚀 Deploy the fine-tuned model in your pipeline

In this tutorial, we will finetune the base model Llama-Nemotron-Embed-1B-v2 - a 1-billion-parameter embedding model that balances quality and inference cost. To get started, follow this setup guide.

Fine-tuning an embedding model requires thousands of (query, relevant document) pairs. Most use cases don’t have this data readily available. Creating it manually is expensive, slow, and often biased by the annotator’s personal interpretation of what’s “relevant.”

Instead of labeling data by hand, you can use an LLM (nvidia/nemotron-3-nano-30b-a3b) to read your documents and automatically generate high-quality synthetic question–answer pairs.

nemotron embed sdg -c default corpus_dir=./data/my_domain_docs

Behind the scenes, this runs a four-stage synthetic data generation (SDG) pipeline powered by NeMo Data Designer:

Source document chunk:

The thermal design power (TDP) of the H100 GPU is 700W in SXM form factor. The cooling solution must maintain junction temperature below 83°C under sustained workloads. Liquid cooling is recommended for dense deployments exceeding 4 GPUs per node, as air cooling cannot dissipate sufficient heat in standard 2U chassis configurations.

Generated QA pairs:

{
"question": "What cooling approach is recommended when deploying more than 4 H100 GPUs per server node?",
"answer": "Liquid cooling is recommended for dense deployments exceeding 4 GPUs per node, as air cooling cannot dissipate sufficient heat in standard 2U chassis configurations.",
"query_type": "contextual",
"reasoning_type": "factual",
"question_complexity": 3,
"segment_ids": [1],
"quality_score": 8.5
}

{
"question": "How does the 700W TDP of the H100 SXM constrain the choice between air and liquid cooling in multi-GPU configurations?",
"answer": "The 700W TDP generates substantial heat that must be dissipated to keep junction temperatures below 83°C. In dense configurations exceeding 4 GPUs per node, air cooling in standard 2U chassis cannot handle this thermal load, making liquid cooling necessary.",
"query_type": "multi_hop",
"reasoning_type": "causal",
"question_complexity": 4,
"segment_ids": [1, 2],
"hop_count": 2,
"quality_score": 9.0
}

Notice the difference: the first question is a simple factual lookup. The second requires multi-hop, causal reasoning. The pipeline generates both types, with configurable complexity levels (2–5) and hop counts (1–3). Each QA pair then undergoes quality evaluation, receiving sub-scores for relevance, accuracy, context support, and clarity, along with an overall score. Only pairs that meet the threshold are included in training.

If you train an embedding model with only positive pairs (query + correct document), it learns to distinguish obviously different documents but fails on the hard cases — passages that look relevant but are not the right answer. In a real retrieval system, these near-misses are exactly the documents that cause bad answers. Hard negative mining finds these confusing passages so the model can learn to tell them apart.

nemotron embed prep -c default

The above command runs three sub-steps automatically:

The generated QA pairs are split into training (80%) and test (20%) sets. The test set is formatted as a BEIR-compatible benchmark for standardized evaluation in Step 5.

Using the base embedding model, the pipeline:

Embeds every query and every passage in the corpus.
Computes similarity between each query and all passages.
Masks out each query's labeled positive documents.
Applies a margin filter: any non-positive document scoring above95% of the minimum positive score is eliminated. This exclusion zone guards against false negatives — unlabeled passages that are so close to the positive they may actually be relevant. - From the surviving candidates, selects the top-k highest-scoring documents as hard negatives (5 per query by default).

The result: hard negatives are the most similar non-positive passages that still fall safely below the positive-score ceiling. They are passages the current model considers highly relevant but that are not the labeled answer.

Why this works: Training on easy negatives (completely unrelated passages) teaches the model nothing new. Training on hard negatives forces it to learn the subtle distinctions that matter in your domain. For example, in a medical corpus, a question about "metformin dosage for Type 2 diabetes" might have hard negatives about "metformin side effects" or "insulin dosage for Type 1 diabetes" — close but critically different. The 95% margin ceiling prevents the miner from selecting passages that are too close to the positive, which could actually be correct answers that simply weren't labeled during SDG.

Multi-hop questions reference multiple positive documents. For example, a question like "How does the thermal management system in Section 3.2 relate to the power constraints described in Section 5.1?" has two positive passages.

Unrolling creates one training example per (query, positive document) pair, so the contrastive loss sees each positive independently. A question with 2 positive documents becomes 2 training examples, each with the same hard negatives but a different positive.

The final output is a training-ready JSON file:

{
"question_id": "q42_0",
"question": "How does the 700W TDP of the H100 SXM constrain cooling choices in multi-GPU nodes?",
"pos_doc": [{"id": "d_a1b2c3"}],
"neg_doc": [{"id": "d_x7y8z9"}, {"id": "d_m4n5o6"}, {"id": "d_p1q2r3"}, {"id": "d_s4t5u6"}, {"id": "d_v7w8x9"}]
}

Standard embedding fine-tuning generates one question per passage and trains the model to match them. This works for simple factual lookups, but real users ask complex questions that span multiple documents or sections. If the model has only seen single-hop training data, it will struggle to retrieve all the relevant passages for these complex queries.

The SDG pipeline generates questions at 1 to 3 hops by default:

1-hop:"What is the TDP of the H100 SXM?" — answered by a single passage.2-hop:"How does the H100's TDP relate to cooling requirements in dense deployments?" — requires connecting information from two passages.3-hop:"Given the TDP, cooling constraints, and rack density limits, what is the maximum number of H100 GPUs deployable in a standard data center row?" — synthesizes three passages.

Each hop is tracked with its own context summary and segment IDs, so the training data preserves the

Source: Hugging Face Blog