Google's Gemini Embedding 2 arrives with native multimodal support to cut costs and speed up your enterprise data stack

Google has announced the public preview of Gemini Embedding 2, a new model with native multimodal support that integrates text, images, video, and audio to reduce latency by up to 70% and lower costs for enterprises.

Yesterday amid a flurry of enterprise AI product updates, Google announced arguably its most significant one for enterprise customers: the public preview availability of Gemini Embedding 2, its new embeddings model — a significant evolution in how machines represent and retrieve information across different media types.

While previous embedding models were largely restricted to text, this new model natively integrates text, images, video, audio, and documents into a single numerical space — reducing latency by as much as 70% for some customers and reducing total cost for enterprises who use AI models powered by their own data to complete business tasks.

VentureBeat collaborator Sam Witteveen, co-founder of AI and ML training company Red Dragon AI, received early access to Gemini Embedding 2 and published a video of his impressions on YouTube. Watch it below:

Who needs and uses an embedding model?

For those who have encountered the term "embeddings" in AI discussions but find it abstract, a useful analogy is that of a universal library.

In a traditional library, books are organized by metadata: author, title, or genre. In the "embedding space" of an AI, information is organized by ideas.

Imagine a library where books aren't organized by the Dewey Decimal System, but by their "vibe" or "essence". In this library, a biography of Steve Jobs would physically fly across the room to sit next to a technical manual for a Macintosh. A poem about a sunset would drift toward a photography book of the Pacific Coast, with all thematically similar content organized in beautiful hovering "clouds" of books. This is basically what an embedding model does.

An embedding model takes complex data—like a sentence, a photo of a sunset, or a snippet of a podcast—and converts it into a long list of numbers called a vector.

These numbers represent coordinates in a high-dimensional map. If two items are "semantically" similar (e.g., a photo of a golden retriever and the text "man's best friend"), the model places their coordinates very close to each other in this map. Today, these models are the invisible engine behind:

Search Engines: Finding results based on what youmean, not just the specific words you typed.Recommendation Systems: Netflix or Spotify suggesting content because its "coordinates" are near things you already like.Enterprise AI: Large companies use them forRetrieval-Augmented Generation (RAG), where an AI assistant "looks up" a company's internal PDFs to answer an employee's question accurately.

The concept of mapping words to vectors dates back to the 1950s with linguists like John Rupert Firth, but the modern "vector revolution" began in the early 2000s when Yoshua Bengio’s team first used the term "word embeddings". The real breakthrough for the industry was Word2Vec, released by a team at Google led by Tomas Mikolov in 2013. Today, the market is led by a handful of major players:

OpenAI: Known for its widely-used text-embedding-3 series.Google: With the new Gemini and previous Gecko models.Anthropic and Cohere: Providing specialized models for enterprise search and developer workflows.

By moving beyond text to a natively multimodal architecture, Google is attempting to create a singular, unified map for the sum of human digital expression—text, images, video, audio, and documents—all residing in the same mathematical neighborhood.

Why Gemini Embedding 2 is such a big deal

Most leading models are still "text-first." If you want to search a video library, the AI usually has to transcribe the video into text first, then embed that text.

Google’s Gemini Embedding 2 is natively multimodal.

As Logan Kilpatrick of Google DeepMind posted on X, the model allows developers to "bring text, images, video, audio, and docs into the same embedding space".

It understands audio as sound waves and video as motion directly, without needing to turn them into text first. This reduces "translation" errors and captures nuances that text alone might miss.

For developers and enterprises, the "natively multimodal" nature of Gemini Embedding 2 represents a shift toward more efficient AI pipelines.

By mapping all media into a single 3,072-dimensional space, developers no longer need separate systems for image search and text search; they can perform "cross-modal" retrieval—using a text query to find a specific moment in a video or an image that matches a specific sound.

And unlike its predecessors, Gemini Embedding 2 can process requests that mix modalities. A developer can send a request containing both an image of a vintage car and the text "What is the engine type?". The model doesn't process them separately; it treats them as a single, nuanced concept. This allows for a much deeper understanding of real-world data where the "meaning" is often found in the intersection of what we see and what we say.

One of the model's more technical features is Matryoshka Representation Learning. Named after Russian nesting dolls, this technique allows the model to "nest" the most important information in the first few numbers of the vector.

An enterprise can choose to use the full 3072 dimensions for maximum precision, or "truncate" them down to 768 or 1536 dimensions to save on database storage costs with minimal loss in accuracy.

Benchmarking the performance gains of moving to multimodal

Gemini Embedding 2 establishes a new performance ceiling for multimodal depth, specifically outperforming previous industry leaders across text, image, and video evaluation tasks.

The model’s most significant lead is found in video and audio retrieval, where its native architecture allows it to bypass the performance degradation typically associated with text-based transcription pipelines.

Specifically, in video-to-text and text-to-video retrieval tasks, the model demonstrates a measurable performance gap over existing industry leaders, accurately mapping motion and temporal data into a unified semantic space.

The technical results show a distinct advantage in the following standardized categories:

Multimodal Retrieval: Gemini Embedding 2 consistently outperforms leading text and vision models in complex retrieval tasks that require understanding the relationship between visual elements and textual queries.Speech and Audio Depth: The model introduces a new standard for native audio embeddings, achieving higher accuracy in capturing phonetic and tonal intent compared to models that rely on intermediate text-transcription.Contextual Scaling: In text-based benchmarks, the model maintains high precision while utilizing its expansive 8,192 token context window, ensuring that long-form documents are embedded with the same semantic density as shorter snippets.**Dimension Flexibility:**Testing across the Matryoshka Representation Learning (MRL) layers reveals that even when truncated to 768 dimensions, the model retains a significant majority of its 3,072-dimension performance, outperforming fixed-dimension models of similar size.

What it means for enterprise databases

For the modern enterprise, information is often a fragmented mess. A single customer issue might involve a recorded support call (audio), a screenshot of an error (image), a PDF of a contract (document), and a series of emails (text).

In previous years, searching across these formats required four different pipelines. With Gemini Embedding 2, an enterprise can create a Unified Knowledge Base. This enables a more advanced form of RAG, wherein a company’s internal AI doesn't just look up facts, but understands the relationship between them regardless of format.

Early partners are already reporting drastic efficiency gains:

Sparkonomy, a creator economy platform, reported that the model’s native multimodality slashed their latency by up to 70%. By removing the need for intermediate LLM "inference" (the step where one model explains a video to anothe

Source: VentureBeat