Maps text, images, videos, audio, and documents into a single, unified embedding space to capture the semantic relationships across data

Performance

State-of-the-art results on a range of cross-modal benchmarks

Metric typeMetric nameGemini
Embedding 2
gemini-embedding-
001 Legacy text-only Google model
multimodalembedding
@001 Legacy multimodal Google model
Amazon Nova 2
Multimodal Embeddings
Voyage
Multimodal 3.5
Text-TextMTEB (Multilingual) Mean (Task)69.968.463.8**58.5***
MTEB (Code) Mean (Task)84.076.0**
Text-ImageTextCaps recall@189.674.076.079.4
Docci recall@193.484.083.8
Image-TextTextCaps recall@197.488.188.988.6
Docci recall@191.376.577.4
Text-DocumentViDoRe v2 ndcg@1064.928.960.665.5**
Text-VideoVatex ndcg@1068.854.960.355.2
MSR-VTT ndcg@1068.057.967.063.0**
Youcook2 ndcg@1052.534.934.731.4**
Speech-TextMSEB mrr@1073.9*
MSEB (ASR)**** mrr@1070.4*

* score not available
** self-reported
*** voyage-3.5
**** ASR model converts audio queries to text

Hands-on

Generate embeddings and explore how you can use them

A screenshot of the "Multimodal Search" web interface powered by Gemini Embedding 2, featuring three columns displaying text descriptions, images, and audio files.A screenshot of the "Multimodal Search" web interface powered by Gemini Embedding 2, featuring three columns displaying text descriptions, images, and audio files.

Multimodal Search with Gemini Embedding 2

Surface the most relevant matches across modalities by calculating semantic similarity.


Model information

Name
Embedding 2
Status
Generally available
Input
  • Text
  • Image
  • Video
  • Audio
  • Documents
Output
Embeddings
Input tokens
8,192
Dimension sizes
128 - 3072
Availability
  • Gemini API
  • Gemini Enterprise Agent Platform
Documentation
View Gemini API docs
View Google Cloud docs