SAM 2 Explained: Promptable Segmentation Across Video

Quick answer

SAM 2 is Meta AI’s segmentation foundation model that tracks an object through a video after a single prompt, instead of re-segmenting every frame from scratch. It reports better video accuracy while needing 3x fewer interactions than prior video segmentation methods, and on still images it is more accurate and 6x faster than the original SAM. The mechanism that makes this work is a streaming memory module plus the largest video segmentation dataset released to date, SA-V.

From single images to a video stream

The first Segment Anything Model made image segmentation feel like a solved interaction: click a point, get a clean mask, do it for anything. Video broke that. Objects move, change scale, leave the frame, get occluded, and come back. Run an image model frame by frame and you get a stack of independent masks with no idea that frame 1’s dog is also frame 200’s dog. The user ends up re-clicking constantly.

SAM 2 reframes the task as promptable visual segmentation over a stream rather than a stack. You prompt an object once — a click, box, or mask on any frame — and the model propagates that identity forward and backward through the video. The interface stays the same as SAM; what changes is that the model now remembers.

The streaming memory module

The architecture is deliberately plain: a single transformer with a memory mechanism that runs in streaming fashion, one frame at a time, for real-time processing. As each frame arrives, the model encodes it, attends to a memory of past frames and the prompts it has already seen, and emits the mask for the current frame. Those predictions and features are written back into the memory bank, so the next frame inherits context instead of starting cold.

The judgment worth making here: the contribution is not a clever loss or an exotic backbone, it is treating memory as a first-class input to a segmentation model. That is what lets one prompt survive an occlusion — when the object disappears and reappears, memory is the thing that says “this is still the object you meant.” Frame-by-frame SAM simply cannot do that.

The SA-V dataset

A streaming model needs streaming supervision, and that did not exist at scale. So Meta built a data engine: annotators use SAM 2 itself to label video, the labels train a better SAM 2, and the better model makes the next round of labeling faster. This loop produced SA-V, described as the largest video segmentation dataset to date, covering objects and parts across diverse scenes. The dataset is a deliverable in its own right — for video segmentation it is closer to the moat than the model weights, because anyone can train on it.

Key results

3x fewer interactions than prior video segmentation approaches to reach better accuracy — the headline efficiency claim, and the one users feel directly as fewer corrective clicks.
On image segmentation, SAM 2 is more accurate and 6x faster than the original SAM, so it is not a video-only specialist; it is a strict upgrade on the image task too.
Meta released the main model, the SA-V dataset, training code, and an interactive demo — full enough that the results can be reproduced and adapted rather than just cited.

The combination is what matters: most “track anything” systems trade interaction count for accuracy or speed for accuracy. SAM 2 reports moving on both axes at once relative to its own predecessor and to prior video methods.

Limits and open questions

Memory policy is unsolved. Streaming memory raises the question of what to keep and what to forget; long videos, scene cuts, and crowds of similar objects can still confuse which memory belongs to which target.
Hard occlusion and re-identification. Object permanence works in many cases, but a long disappearance, a near-identical distractor, or an abrupt cut can still drop the track and force a re-prompt.
Interaction is not free. The 3x-fewer-clicks claim still assumes a human willing to correct; for fully unattended labeling, residual errors compound across a long clip.
Masks, not understanding. SAM 2 segments and tracks; it does not name objects, reason about actions, or model physical relations. Whether promptable video perception extends from masks to richer object state is the open frontier.

FAQ

What is SAM 2 and how is it different from SAM?

SAM 2 is Meta AI’s promptable segmentation foundation model for both images and video. The original SAM segmented single images; SAM 2 adds a streaming memory module so one prompt can track an object across video frames, and it is also more accurate and 6x faster than SAM on images.

How does SAM 2 track an object through a video?

You prompt the object once on any frame, and SAM 2’s memory module carries that identity forward and backward through the stream. Each new frame attends to a memory bank of past frames and prompts, so the model knows it is still segmenting the same object rather than starting over each frame.

What is the SA-V dataset?

SA-V is the video segmentation dataset Meta built alongside SAM 2, described as the largest of its kind to date. It was created with a data engine where annotators use SAM 2 to label video and those labels train a stronger model, in a loop.

Is SAM 2 open and can I use it?

Yes — Meta released the main SAM 2 model, the SA-V dataset, model training code, and an interactive demo, which is what makes it practical to fine-tune or build on rather than only read about.

When does SAM 2 still fail?

SAM 2 can lose a track under long occlusions, scene cuts, or when several near-identical objects appear together, and it still needs occasional corrective prompts. It also only produces masks — it does not label or reason about what the objects are doing.

SAM 2’s real contribution is simple to state: it gives segmentation a memory, so one click can follow an object through time. Read the original paper on arXiv.