Segmentation · Vision Foundation Models
Segment Anything (SAM): One Promptable Model, a Billion Masks
Meta AI's SAM treats segmentation as a promptable task and ships with SA-1B (1.1B masks on 11M images), letting one model transfer zero-shot to new objects and image distributions.
Quick answer
Segment Anything (SAM) is one model that segments objects from a prompt — a click, a box, or a coarse mask — and the headline is that it was trained on SA-1B: over 1 billion masks on 11M licensed images, by far the largest segmentation dataset to date. That scale is what lets a single model transfer zero-shot to new image distributions and tasks, often competitive with or even better than prior fully supervised systems.
Promptable segmentation as the task
The reframing is the real idea. Before SAM, a segmentation model was trained for a fixed set of categories or one annotation style; ask it for something outside that and it fails. SAM borrows the prompting recipe from foundation language models and applies it to pixels: instead of “label every pixel as one of N classes,” the task is “given any prompt, return a valid mask.” A point, a bounding box, or a rough mask all count as prompts, and when a prompt is ambiguous (a click on a shirt could mean the shirt, the person, or the whole crowd) the model returns several plausible masks rather than guessing one. That ambiguity-awareness is what makes it usable as an interactive tool, not just a benchmark number.
Architecturally it splits the work so prompting stays cheap: a heavy ViT image encoder runs once per image, then a lightweight prompt encoder and mask decoder turn each prompt into a mask in milliseconds. You can re-prompt the same image many times without re-encoding it — the design decision that makes click-to-segment feel instant in an annotation tool.
The data engine behind SA-1B
You cannot annotate a billion masks by hand, so the contribution that actually carries the paper is the data engine: a three-stage loop where the model and the dataset bootstrap each other. First annotators label masks with model assistance; as the model improves it pre-fills masks so people only fix mistakes; finally the model runs fully automatically, proposing masks that are filtered for quality. The result is SA-1B — 1.1B masks on 11M images — with the later stages dominated by automatic generation. The masks are class-agnostic (no category labels), which is exactly why the model learns “what is an object” generally rather than memorizing a taxonomy.
Key results
- SA-1B contains over 1 billion masks on 11 million licensed, privacy-respecting images — far larger than any prior segmentation dataset.
- The image encoder is a heavyweight ViT, but the prompt encoder and mask decoder are light enough to run in real time after the image is encoded once.
- Zero-shot transfer is the central result: the paper reports SAM’s zero-shot performance is impressive across many tasks, often competitive with or superior to prior fully supervised results — without task-specific training.
- Ambiguity is handled by predicting multiple masks per prompt with confidence scores, not a single forced answer.
The honest judgment: SAM’s lasting impact is the dataset and the promptable formulation, not a single SOTA accuracy number. It is a strong, general mask proposer — a primitive other systems build on.
Limits and open questions
SAM segments; it does not understand. It returns a mask without knowing the object’s category, identity, function, or whether segmenting it matters for safety. The paper and follow-up use surface real gaps: fine structures and thin objects, low-contrast or transparent regions, and strong domain shift (medical imaging, satellite, microscopy) where masks degrade and need domain-specific tuning. It is a still-image model — no native video or temporal consistency, which later work (SAM 2) targets directly. And the largest image encoder is expensive to run, so “real-time” applies to prompting, not to the one-time encode on constrained hardware. If you need labeled semantic segmentation out of the box, SAM alone is the wrong tool — you still need a classifier on top.
FAQ
What is Segment Anything (SAM) in one sentence?
SAM is Meta AI’s promptable segmentation model that, given a point, box, or mask prompt, returns valid object masks and transfers zero-shot to images and tasks it was not trained on.
What is the SA-1B dataset?
SA-1B is the dataset released with SAM: over 1 billion masks on 11 million licensed, privacy-respecting images, built with a model-in-the-loop data engine and dominated by automatically generated masks. It is the largest segmentation dataset to date.
Does SAM know what objects it is segmenting?
No. SAM produces class-agnostic masks — it outlines an object’s pixels without assigning a category, identity, or label. You need a separate model on top if you want to name what was segmented.
Where does Segment Anything still struggle?
SAM weakens on thin or fine structures, transparent and low-contrast regions, and out-of-distribution domains like medical or satellite imagery, and it has no native video or temporal handling.
Is SAM better than supervised segmentation models?
On zero-shot transfer the paper reports SAM is often competitive with or superior to prior fully supervised results, but for a fixed labeled task with abundant in-domain data a specialized supervised model can still win.
SAM’s real bet is not a leaderboard score — it is that a billion-mask data engine plus a promptable interface turns segmentation into a reusable primitive. Read the original: arXiv:2304.02643.