Topics

Vision Foundation Models

Large visual representation models that transfer across recognition, localization, and perception tasks.

Close view of visual sensing hardware and circuit detail

Vision foundation models turn images and video into reusable representations instead of one-off task models. The core shift is from training a classifier or detector for a narrow label set to training a visual backbone that can transfer across recognition, segmentation, dense prediction, retrieval, and multimodal reasoning.

The papers in this topic show three complementary routes. ViT imports the Transformer token interface into images. DINOv2 emphasizes self-supervised features and curated data. Segment Anything reframes segmentation as a promptable primitive. SAM 2 extends that interaction pattern into video. Together they explain why visual AI is moving from benchmark-specific models toward general perception infrastructure.

Start here

Self-Supervised Learning · Meta AI

DINOv2: Self-Supervised Visual Features That Skip Finetuning

DINOv2 pretrains Vision Transformers with no labels on a curated 142M-image set, then freezes the backbone — a linear probe on top matches or beats OpenCLIP on most image- and pixel-level benchmarks.

Segmentation · Meta AI

Segment Anything (SAM): One Promptable Model, a Billion Masks

Meta AI's SAM treats segmentation as a promptable task and ships with SA-1B (1.1B masks on 11M images), letting one model transfer zero-shot to new objects and image distributions.

Vision Foundation Models · Google Research

Vision Transformer (ViT): An Image is Worth 16x16 Words

ViT shows a plain Transformer fed raw 16x16 image patches beats top CNNs once pre-trained on JFT-300M, reaching 88.55% on ImageNet while using far less training compute.

Foundational papers

Segmentation · Independent Researcher

U-Net: The Biomedical Image Segmentation Baseline

U-Net turns biomedical image segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Vision Foundation Models · Microsoft Research

ResNet Explained: Deep Residual Learning for Image Recognition

ResNet adds skip connections so a layer learns a residual instead of a full mapping, making 152-layer networks trainable. An ensemble hit 3.57% top-5 error on ImageNet and won ILSVRC 2015.

Segmentation · Google Research

DeepLab: Atrous Convolution for Semantic Segmentation

DeepLab turns semantic image segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Segmentation · Meta AI

Mask R-CNN: Instance Segmentation on Top of Faster R-CNN

Mask R-CNN turns instance segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Recent papers

AI Agents · NVIDIA

SpatialClaw: Why VLM Spatial Agents Need a Python Workspace

SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.

Brain Decoding · Independent Researcher

Brain-Diffuser: Natural Scene Reconstruction from fMRI

Brain-Diffuser turns natural scene reconstruction from fMRI signals into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Self-Supervised Learning · Google DeepMind

BYOL: Self-Supervised Learning without Negative Pairs

BYOL turns self-supervised visual learning without negative pairs into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Segmentation · Google Research

DeepLab: Atrous Convolution for Semantic Segmentation

DeepLab turns semantic image segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Text-to-Image · Independent Researcher

DIRECT: 3D-Aware Object Insertion with Visual Proxies

DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Diffusion Models · The Hong Kong Polytechnic University

GGT-100K: Generative Ground Truth for Image Restoration

GGT-100K: Generative Ground Truth for Image Restoration turns real-world image restoration data into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · NVIDIA

SpatialClaw: Why VLM Spatial Agents Need a Python Workspace

SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.

Brain Decoding · Independent Researcher

Brain-Diffuser: Natural Scene Reconstruction from fMRI

Brain-Diffuser turns natural scene reconstruction from fMRI signals into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Self-Supervised Learning · Google DeepMind

BYOL: Self-Supervised Learning without Negative Pairs

BYOL turns self-supervised visual learning without negative pairs into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Segmentation · Google Research

DeepLab: Atrous Convolution for Semantic Segmentation

DeepLab turns semantic image segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Text-to-Image · Independent Researcher

DIRECT: 3D-Aware Object Insertion with Visual Proxies

DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Diffusion Models · The Hong Kong Polytechnic University

GGT-100K: Generative Ground Truth for Image Restoration

GGT-100K: Generative Ground Truth for Image Restoration turns real-world image restoration data into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Self-Supervised Learning · Meta AI

MAE: Masked Autoencoders as Scalable Vision Learners

MAE turns masked image modeling for vision pretraining into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Segmentation · Meta AI

Mask R-CNN: Instance Segmentation on Top of Faster R-CNN

Mask R-CNN turns instance segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Multimodal Models · Shanghai AI Laboratory

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs turns streaming spatial intelligence into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Self-Supervised Learning · Google Research

SimCLR: Contrastive Learning for Visual Representations

SimCLR turns contrastive visual representation learning into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Robotics · Independent Researcher

TVRBench: Can Models Move to a Target Viewpoint?

TVRBench: Can Models Move to a Target Viewpoint? turns active 3D viewpoint reproduction into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Segmentation · Independent Researcher

U-Net: The Biomedical Image Segmentation Baseline

U-Net turns biomedical image segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Segmentation · Meta AI

Mask2Former: One Transformer for Segmentation Tasks

Mask2Former uses masked attention to unify semantic, instance, and panoptic segmentation, reaching 57.8 PQ on COCO panoptic and 57.7 mIoU on ADE20K.

Multimodal Models · Meta AI

VLM3: Vision Language Models Are Native 3D Learners

VLM3 shows a standard 4B vision-language model matches expert 3D models — 0.904 depth accuracy, 94.0% camera-pose AUC, 91.35% object-3D accuracy — with no 3D-specific architecture, only focal unification and scaling.

Brain Decoding · MIT

BrainCause: Finding Causal Visual Representations in the Brain

BrainCause uses text-to-image generation plus an fMRI encoder to causally test what brain regions represent, cutting false-positive localizations from 73.4% to 23% across 260 visual concepts.

Self-Supervised Learning · Meta AI

DINOv2: Self-Supervised Visual Features That Skip Finetuning

Multimodal Models · Google DeepMind

Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo bolts trainable cross-attention onto a frozen vision encoder and a frozen language model, then learns new image and video tasks from a handful of in-context examples — no fine-tuning.

Multimodal Models · NVIDIA

LocateAnything: Parallel Box Decoding for Faster Vision Grounding

LocateAnything emits each bounding box in a single decoding step instead of digit-by-digit, hitting 12.7 boxes/sec in hybrid mode — about 2.5x faster than Rex-Omni-3B — while leading on COCO and LVIS at the same 3B size.

Vision Foundation Models · Microsoft Research

ResNet Explained: Deep Residual Learning for Image Recognition

ResNet adds skip connections so a layer learns a residual instead of a full mapping, making 152-layer networks trainable. An ensemble hit 3.57% top-5 error on ImageNet and won ILSVRC 2015.

Segmentation · Meta AI

Segment Anything (SAM): One Promptable Model, a Billion Masks

Meta AI's SAM treats segmentation as a promptable task and ships with SA-1B (1.1B masks on 11M images), letting one model transfer zero-shot to new objects and image distributions.

Vision Foundation Models · Google Research

Vision Transformer (ViT): An Image is Worth 16x16 Words

ViT shows a plain Transformer fed raw 16x16 image patches beats top CNNs once pre-trained on JFT-300M, reaching 88.55% on ImageNet while using far less training compute.

Multimodal Models · OpenAI

CLIP: Learning Visual Models From Natural Language Supervision

CLIP trains paired image and text encoders on 400 million internet image-text pairs, then matches the original ResNet-50's ImageNet accuracy zero-shot — without using any of its 1.28M labeled examples.

Segmentation · Meta AI

SAM 2 Explained: Promptable Segmentation Across Video

SAM 2 carries one click through a whole video using a streaming memory module, hitting better masks with 3x fewer interactions than prior video methods and running 6x faster than SAM on images.

Start here

Foundational papers

Recent papers

Related topics