Embeddings model choices

Published on March 11, 2026

Running an experiment to see if gemini-embeddings-2, which is a multimodal embeddings model, is a viable substitute for OpenAI's text-embedding-3-small.

The gemini model accepts specifications for the purpose of the embedding: retrieval (query or document), or semantic similarity. Would be interesting to understand how exactly this modifies the vector
Embedding text fragments in the retrieval mode gives me specific, rigid 'suggestions' (i.e., related/proximal fragments); these turn out to be less interesting than the OpenAI ones. The latter seem to have captured gestalt, or vibe better.

A possible challenge is that I need these embeddings to be both retrieval-optimised (for search), and semantic-similarity-optimised (for traversal, thread-ing, and more creative tasks).

Testing different categories of fragment

Literary (The Tartar Steppe)
- OpenAI related two fragments on the shortness of life that Gemini did not Gemini had less intra-artefact suggestions; it related across texts more
- Gemini had more literary relations: more from Calvino, Adams, Dostoyevsky
Non-fiction Essay (Sam Kriss on Sudan)
- For a fragment about the ethnic makeups of the different sides of the conflict OpenAI suggested multiple fragments about various Turkic ethnicities. Gemini kept it about Sudan, and conflict in Africa.
- Also had some obviously better results on herders in Sudan; related this to herder fragments in Chaffetz, but not recklessly.
A fragment on Velcheru Narayana Rao
- Gemini suggested fragments from a text by Rao; OpenAI didn't. This seems conclusive!

Thread building

A thread on 'language as prescriptive'
- Gemini's results were narrower and deeper: a lot more from the same source/domain
  - Dropping the thresholds to standardise them helped with this, but the Gemini thread was still narrower
  - The thread was a 'theme' thread though, so this might even be desirable
An exploratory thread on the value of style in literature and art
- OpenAI's was more 'exploratory' and interesting/made more cross-domain connections

Strange situation here: whatever helps with search precision works against Pond's value as a discovery tool.

Have just come to the conclusion that I don't need a multimodal model; describing an image that can then be embedded with the rest is the ideal approach: pond is concerned more with the conceptual content of an image, which can be captured in text