Multimodal Models · Vision Foundation Models
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo bolts trainable cross-attention onto a frozen vision encoder and a frozen language model, then learns new image and video tasks from a handful of in-context examples — no fine-tuning.
Quick answer
Flamingo is a visual language model that learns new image and video tasks from a few prompt examples instead of task-specific fine-tuning. A single Flamingo model sets a new state of the art on several benchmarks with few-shot prompting, and on numerous tasks it beats prior models that were fine-tuned on thousands of times more labeled data. The trick is to keep a strong vision encoder and a strong language model frozen, and train only the layers that connect them.
Bridging frozen vision and language models
The core design choice is what Flamingo does not train. Both the vision encoder and the large language model are pretrained separately and then frozen — their weights never move during Flamingo training. What gets trained is the connective tissue between them.
Two new components carry the load. A Perceiver Resampler takes the variable-size grid of features from the vision encoder and compresses it into a small, fixed number of visual tokens, so a high-resolution image or a multi-frame video clip both turn into a manageable, consistent representation. Then gated cross-attention layers are interleaved into the frozen language model: at these layers the text tokens attend to the resampled visual tokens. A learned gate starts those layers near-identity, so at initialization Flamingo behaves like the original language model and the visual pathway is eased in during training rather than disrupting the backbone.
This is the part worth dwelling on, because it is the lever the whole field later pulled: freezing two expensive pretrained models and training only a bridge is far cheaper than training a multimodal model end to end, and it inherits the language model’s text fluency for free.
Few-shot learning with interleaved prompts
Flamingo’s input is a single sequence of arbitrarily interleaved images, video frames, and text. That format is what enables in-context learning: you can lay several <image> caption pairs into the prompt, then append a new image, and the model continues the pattern — exactly how a text LLM does few-shot prompting, but now with pixels in the sequence.
This matters because the same model handles open-ended tasks (visual question answering where it must generate an answer, captioning where it must describe a scene or event) and close-ended tasks (multiple-choice VQA) without any architecture change — only the examples in the prompt change. The interleaved format also dictates the training data: Flamingo is trained on large-scale web corpora of naturally interleaved images and text, which is precisely the signal that teaches it to read a mixed image-text context rather than just caption one isolated image.
Key results
A single Flamingo model reaches a new state of the art across a spectrum of image and video benchmarks using few-shot prompting alone, with no per-task weight updates. On numerous benchmarks it outperforms models that were fine-tuned on thousands of times more task-specific annotated data — the headline claim, and the one that made the architecture influential.
The evaluation deliberately spans the task spectrum: open-ended generation (VQA, captioning of images and video) and close-ended selection (multiple-choice VQA). The point of testing across that range is that one set of frozen weights, steered only by in-context examples, covers all of it. The practical reading: Flamingo traded the cost of collecting and labeling large task-specific datasets for the cost of writing a good prompt with a few examples.
Limits and open questions
Flamingo is a 2022 result, and several caveats are structural, not incidental. Few-shot prompting is sensitive to which examples you pick and how they are ordered, so reported few-shot numbers reflect a favorable setup more than a guarantee. Because both backbones are frozen, Flamingo inherits whatever biases, blind spots, and stale knowledge live in its vision and language pretraining — the bridge cannot fix what the frozen models get wrong. The web-scraped interleaved training data carries its own bias and toxicity baggage. And like other open-ended VLMs, it can hallucinate confident but wrong descriptions, which is the failure mode that actually blocks deployment. Grounding fidelity, hallucination control, and inference cost for a model this size remain the open problems the paper does not close.
FAQ
What is Flamingo and what does it do?
Flamingo is a family of visual language models from DeepMind that take interleaved images, video, and text as input and generate text. It learns new multimodal tasks — captioning, visual question answering, multiple-choice VQA — from a few examples placed in the prompt, rather than from task-specific fine-tuning.
How does Flamingo connect a vision encoder to a language model?
Both the vision encoder and the language model are pretrained and kept frozen. Flamingo trains two new pieces between them: a Perceiver Resampler that turns variable visual features into a fixed set of visual tokens, and gated cross-attention layers inserted into the language model so text tokens can attend to those visual tokens.
Why is Flamingo important for modern multimodal AI?
Flamingo popularized the recipe most later visual language models use: keep strong frozen backbones, train only a lightweight cross-modal bridge, and use the prompt as the task interface. It showed that few-shot in-context learning — already familiar from text LLMs — works when images and video are interleaved into the sequence.
Does Flamingo need fine-tuning for each new task?
No. The headline result is that a single Flamingo model, with frozen weights, reaches state-of-the-art on several benchmarks using only few-shot prompting, and on numerous tasks beats models fine-tuned on thousands of times more labeled data. You change the prompt examples, not the weights.
One line: Flamingo’s lasting idea is that a small trainable bridge between two frozen giants turns multimodal tasks into a prompting problem. Read the original paper at https://arxiv.org/abs/2204.14198.