Llama 3: A 405B Dense Open Model That Matches GPT-4
Meta released Llama 3 as a herd of language models led by a dense 405B-parameter flagship with a 128K context window, trained on 15T+ tokens and openly published with weights.
Quick answer
Llama 3 is a family of language models whose flagship is a dense 405B-parameter Transformer with a 128K-token context window, trained on over 15 trillion tokens and released openly with weights. Meta reports it reaches quality comparable to GPT-4 on a wide range of tasks, alongside smaller 8B and 70B models for everyday deployment. The headline is not a new architecture trick — it is that a frontier-class model, plus its training and evaluation report, is now public.
A 405B dense model, fully open
The most consequential design choice is what Llama 3 isn’t: it is not a mixture-of-experts model. The 405B flagship is a single dense Transformer, so every parameter is active on every token. That makes it heavier to serve than a sparse model of equivalent nominal size, but simpler to reason about, fine-tune, and reproduce — which matters when the goal is an open reference that other labs can build on.
The release is a herd, not a single checkpoint. The 8B and 70B models cover the cases where most people actually deploy: laptops, single GPUs, latency-sensitive products. The 405B is the prestige model that proves open weights can sit at the frontier. Meta published both pretrained and post-trained (instruction-tuned) versions of the 405B, plus Llama Guard 3 for input/output safety filtering. Natively, the models target multilinguality, coding, reasoning, and tool use rather than English chat alone.
Data and scale over architecture tricks
Llama 3’s bet is that careful scaling beats clever architecture. The pretraining corpus exceeds 15 trillion tokens — a large jump over Llama 2’s 2T — and the model is a fairly conventional dense Transformer. The interesting engineering is in the recipe: the data mixture and filtering, the scale of pretraining, then post-training with supervised fine-tuning and preference optimization, layered with safety tuning and tool-use behavior. The paper documents this stack in unusual detail, which is the real gift to the field — most frontier reports disclose far less.
The 128K context window is a practical lever, not a benchmark stunt: it lets the model take long documents, codebases, or multi-step tool transcripts in a single pass. Combined with native tool use, that is what makes the herd usable as an agent backbone rather than a chat toy.
Compositional multimodality, still in the lab
The paper also reports experiments adding image, video, and speech to Llama 3 through a compositional approach — bolting modality encoders onto the language model rather than training one native multimodal model from scratch. Meta says these versions perform competitively with the state of the art on image, video, and speech-recognition tasks. The honest caveat is in the paper itself: those multimodal models were still under development and not broadly released. So when people say “Llama 3 is multimodal,” they are describing a research result, not the weights most users actually got.
Key results
Llama 3 405B delivers quality comparable to leading closed models such as GPT-4 across a broad set of tasks, per Meta’s own evaluation. The flagship is a dense 405B Transformer — not MoE — with a 128K context window, pretrained on more than 15 trillion tokens. The open release covers pretrained and post-trained 405B weights plus the 8B and 70B models, and ships Llama Guard 3 as a dedicated safety classifier. The compositional image/video/speech variants performed competitively with state-of-the-art systems on their respective tasks but were withheld as still-in-development. The standout contribution is transparency: the report exposes far more of the training and evaluation pipeline than a typical product launch.
Limits and open questions
Open weights are not open training. You get the parameters, but the exact data mixture, full filtering rules, compute budget, and safety process still require trusting Meta’s report — none of it is independently reproducible at this scale. Serving the 405B dense model is genuinely expensive: because every parameter fires on every token, most real-world value flows through distillation, quantization, or the 70B and 8B models, not the flagship itself. The “GPT-4-comparable” claim rests largely on Meta’s own evaluation harness, so it should be read as a strong signal, not a neutral verdict. And the multimodal story is the softest part of the paper — competitive numbers on a model the public could not download is a promise, not a product.
Who should skip it: if you need a deployable assistant today, start with the 70B or 8B and treat the 405B as a teacher to distill from, not something to run in production.
FAQ
Is Llama 3 405B a mixture-of-experts model?
No. The Llama 3 flagship is a dense 405B-parameter Transformer, meaning all parameters are active on every token. Meta deliberately avoided a sparse MoE design, which is why the model is comparatively expensive to serve but straightforward to fine-tune and study.
How much data was Llama 3 trained on?
Llama 3 was pretrained on more than 15 trillion tokens, a large increase over Llama 2’s roughly 2 trillion. The model also supports a context window of up to 128K tokens at inference time.
Does Llama 3 really match GPT-4?
On a wide range of tasks, Meta reports Llama 3 405B delivers quality comparable to leading closed models such as GPT-4. That result comes from Meta’s own evaluations, so treat it as a strong but not fully independent comparison.
Is Llama 3 multimodal?
Partly. The paper describes compositional image, video, and speech versions that performed competitively with the state of the art, but those models were still under development and not broadly released — the widely available Llama 3 weights are the text models.
Llama 3’s real release was not the 405B model — it was the manual for building one. Read the source: arXiv:2407.21783.