Long Context · Multimodal Models

Gemini 1.5: Near-Perfect Recall Across Millions of Tokens

Gemini 1.5 Pro and Flash keep >99% retrieval recall up to at least 10M tokens of text, video, and audio — and Pro matches Gemini 1.0 Ultra with far less compute.

Gemini 1.5: Near-Perfect Recall Across Millions of Tokens

Quick answer

Gemini 1.5 keeps near-perfect retrieval — above 99% — up to at least 10 million tokens of context, spanning long documents, hours of video, and hours of audio. That is roughly a 50x jump in usable window over contemporaries the report cites: Claude 3.0 at 200k tokens and GPT-4 Turbo at 128k. The catch worth saying out loud: clean needle-in-a-haystack recall is not the same as deep reasoning over that whole haystack.

What a 10M-token context unlocks

Most models force you to chop the world into small windows, so legal files, codebases, long meetings, and multi-hour video all get pre-chunked, retrieved, and summarized before the model ever sees them. Critical detail buried far from the prompt gets dropped at the wrong cut point. Gemini 1.5 attacks the window itself. Instead of asking you to build a retrieval pipeline first, it ingests the whole working set — text, audio, and video together — and is measured on whether it can pull one fine-grained fact out of millions of tokens. The report finds next-token prediction keeps improving and retrieval stays above 99% all the way to 10M tokens, which is what makes “read the entire repository” or “watch the whole film” plausible as a single prompt rather than a pipeline.

Mixture-of-experts efficiency

The headline is not just a bigger number. Gemini 1.5 ships as two compute-efficient multimodal models: an updated 1.5 Pro aimed at capability, and 1.5 Flash, a lighter variant built for efficiency with minimal quality regression. The result that matters for cost: 1.5 Pro matches or surpasses Gemini 1.0 Ultra — the previous flagship — across a broad benchmark set, while being far cheaper to serve. So the long-context leap did not come by paying for a much larger dense model; it came alongside better efficiency, which is what makes million-token serving economically sane.

Key results

  • Retrieval: >99% recall on long-context retrieval up to at least 10M tokens, across text, video, and audio modalities.
  • State of the art: improves long-document QA, long-video QA, and long-context automatic speech recognition (ASR).
  • Capability vs. cost: 1.5 Pro matches or surpasses Gemini 1.0 Ultra across a broad benchmark suite while being more compute-efficient.
  • Real workflows: in a study across 10 job categories, professionals collaborating with the model reported 26% to 75% time savings.
  • In-context language learning: given a grammar manual for Kalamang — a language with fewer than 200 speakers and almost no training data online — the model learns to translate English to Kalamang at a level similar to a person who studied the same manual.

Why the Kalamang result is the interesting one

Benchmark wins age fast. The Kalamang demonstration is the durable signal here: the model was not pre-trained on a low-resource language, yet it picked up usable translation from a single grammar book placed in context. That reframes long context as in-context learning at scale — the window stops being a place to dump documents and becomes a place to teach the model a new skill at inference time, without fine-tuning.

Limits and open questions

Huge context is not free, and it is not the same as understanding. Long prompts are expensive and latency stays real, so a smaller retrieval system plus a short-context model is often cheaper and more controllable. Near-perfect needle-in-a-haystack recall also overstates reasoning: locating one fact across 10M tokens is easier than synthesizing scattered evidence across all of it, and the retrieval benchmarks do not measure the latter. The 26–75% time-savings figure comes from a self-reported professional study, not a controlled trial, so treat it as directional. The honest read: Gemini 1.5 did not kill RAG — it moved the boundary between model context, external memory, and retrieval.

FAQ

How long is Gemini 1.5’s context window?

Gemini 1.5 maintains above-99% retrieval recall up to at least 10 million tokens in the report’s controlled long-context study, covering long documents, hours of video, and hours of audio.

Is Gemini 1.5 Pro better than Gemini 1.0 Ultra?

Gemini 1.5 Pro matches or surpasses Gemini 1.0 Ultra across a broad set of benchmarks while being more compute-efficient, so it delivers the prior flagship’s quality at lower serving cost.

What is the difference between Gemini 1.5 Pro and Flash?

Gemini 1.5 Pro is the capability-focused model; Gemini 1.5 Flash is a lighter, more efficient variant designed for cheaper serving with minimal quality regression.

Can Gemini 1.5 learn a new language in-context?

Yes — given only a grammar manual for Kalamang, a language with fewer than 200 speakers, Gemini 1.5 learns to translate English to Kalamang at roughly the level of a person who studied the same manual, without any task-specific fine-tuning.

Does Gemini 1.5 make RAG obsolete?

No. A 10M-token window shifts the boundary between retrieval and context, but long prompts stay expensive and latency-sensitive, so retrieval pipelines remain cheaper and more controllable for many tasks.

Gemini 1.5 turned the context window from a message box into a workspace — and proved the model could learn a brand-new language from one book inside it. Read the report: https://arxiv.org/abs/2403.05530