Topics
Vision Foundation Models
Large visual representation models that transfer across recognition, localization, and perception tasks.
AI Agents · NVIDIA
SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.
Brain Decoding · Independent Researcher
Brain-Diffuser turns natural scene reconstruction from fMRI signals into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Self-Supervised Learning · Google DeepMind
BYOL turns self-supervised visual learning without negative pairs into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Segmentation · Google Research
DeepLab turns semantic image segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Text-to-Image · Independent Researcher
DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Diffusion Models · The Hong Kong Polytechnic University
GGT-100K: Generative Ground Truth for Image Restoration turns real-world image restoration data into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Self-Supervised Learning · Meta AI
MAE turns masked image modeling for vision pretraining into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Segmentation · Meta AI
Mask R-CNN turns instance segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Multimodal Models · Shanghai AI Laboratory
OVO-S-Bench: Streaming Spatial Intelligence in MLLMs turns streaming spatial intelligence into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Self-Supervised Learning · Google Research
SimCLR turns contrastive visual representation learning into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Robotics · Independent Researcher
TVRBench: Can Models Move to a Target Viewpoint? turns active 3D viewpoint reproduction into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Segmentation · Independent Researcher
U-Net turns biomedical image segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Segmentation · Meta AI
Mask2Former uses masked attention to unify semantic, instance, and panoptic segmentation, reaching 57.8 PQ on COCO panoptic and 57.7 mIoU on ADE20K.
Multimodal Models · Meta AI
VLM3 shows a standard 4B vision-language model matches expert 3D models — 0.904 depth accuracy, 94.0% camera-pose AUC, 91.35% object-3D accuracy — with no 3D-specific architecture, only focal unification and scaling.
Brain Decoding · MIT
BrainCause uses text-to-image generation plus an fMRI encoder to causally test what brain regions represent, cutting false-positive localizations from 73.4% to 23% across 260 visual concepts.
Self-Supervised Learning · Meta AI
DINOv2 pretrains Vision Transformers with no labels on a curated 142M-image set, then freezes the backbone — a linear probe on top matches or beats OpenCLIP on most image- and pixel-level benchmarks.
Multimodal Models · Google DeepMind
Flamingo bolts trainable cross-attention onto a frozen vision encoder and a frozen language model, then learns new image and video tasks from a handful of in-context examples — no fine-tuning.
Multimodal Models · NVIDIA
LocateAnything emits each bounding box in a single decoding step instead of digit-by-digit, hitting 12.7 boxes/sec in hybrid mode — about 2.5x faster than Rex-Omni-3B — while leading on COCO and LVIS at the same 3B size.
Vision Foundation Models · Microsoft Research
ResNet adds skip connections so a layer learns a residual instead of a full mapping, making 152-layer networks trainable. An ensemble hit 3.57% top-5 error on ImageNet and won ILSVRC 2015.
Segmentation · Meta AI
Meta AI's SAM treats segmentation as a promptable task and ships with SA-1B (1.1B masks on 11M images), letting one model transfer zero-shot to new objects and image distributions.
Vision Foundation Models · Google Research
ViT shows a plain Transformer fed raw 16x16 image patches beats top CNNs once pre-trained on JFT-300M, reaching 88.55% on ImageNet while using far less training compute.
Multimodal Models · OpenAI
CLIP trains paired image and text encoders on 400 million internet image-text pairs, then matches the original ResNet-50's ImageNet accuracy zero-shot — without using any of its 1.28M labeled examples.
Segmentation · Meta AI
SAM 2 carries one click through a whole video using a streaming memory module, hitting better masks with 3x fewer interactions than prior video methods and running 6x faster than SAM on images.