Self-Supervised Learning · Meta AI
MAE: Masked Autoencoders as Scalable Vision Learners
MAE turns masked image modeling for vision pretraining into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Institution
Meta's AI research organization, known for open models, computer vision systems, and large-scale infrastructure.
Self-Supervised Learning · Meta AI
MAE turns masked image modeling for vision pretraining into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Mask R-CNN turns instance segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Brain2Qwerty decodes typed sentences from non-invasive brain recordings: MEG reaches 32% CER on average, EEG trails at 67%, and the best participants reach 19%.
Mask2Former uses masked attention to unify semantic, instance, and panoptic segmentation, reaching 57.8 PQ on COCO panoptic and 57.7 mIoU on ADE20K.
Small Language Models · Meta AI
MobileLLM argues architecture matters more at sub-billion scale: deep-thin designs plus sharing improve 125M/350M models by 2.7%/4.3%, then 0.7%/0.8% more.
VLM3 shows a standard 4B vision-language model matches expert 3D models — 0.904 depth accuracy, 94.0% camera-pose AUC, 91.35% object-3D accuracy — with no 3D-specific architecture, only focal unification and scaling.
Code Llama continues training Llama 2 on code, reaching up to 67% on HumanEval and 65% on MBPP, the best open scores at its release, with infilling, instruction following, and 100k-token context support.
Self-Supervised Learning · Meta AI
DINOv2 pretrains Vision Transformers with no labels on a curated 142M-image set, then freezes the backbone — a linear probe on top matches or beats OpenCLIP on most image- and pixel-level benchmarks.
Llama 2 shipped 7B, 13B, and 70B open-weight models plus Llama 2-Chat, the first open chat model whose RLHF pipeline — including a separate safety reward model and Ghost Attention — was documented in full.
Retrieval-Augmented Generation · Meta AI
The original RAG paper bolts a Wikipedia dense retriever (DPR) onto a BART seq2seq generator, set new state-of-the-art on three open-domain QA tasks, and updates knowledge by swapping the index — no retraining.
Meta AI's SAM treats segmentation as a promptable task and ships with SA-1B (1.1B masks on 11M images), letting one model transfer zero-shot to new objects and image distributions.
Toolformer trains a model to decide which API to call — calculator, QA, search, translation, calendar — purely by keeping the sampled calls that lower next-token loss, with only a handful of demos per tool.
Meta released Llama 3 as a herd of language models led by a dense 405B-parameter flagship with a 128K context window, trained on 15T+ tokens and openly published with weights.
SAM 2 carries one click through a whole video using a streaming memory module, hitting better masks with 3x fewer interactions than prior video methods and running 6x faster than SAM on images.