Phi-3-mini: A 3.8B Model That Rivals GPT-3.5 on Your Phone

Quick answer

Phi-3-mini is a 3.8-billion-parameter language model from Microsoft that scores 69% on MMLU and 8.38 on MT-bench — on par with Mixtral 8x7B and GPT-3.5, models many times its size — yet quantizes down to about 1.8GB and runs offline on an iPhone 14 at roughly 12 tokens per second. The paper’s real claim is not the architecture, which is a plain Llama-2-style transformer, but the data: phi-3 was trained on 3.3 trillion tokens of heavily filtered web text plus synthetic “textbook-quality” data, arguing that what you train on matters more than how big you make the model.

The “textbook-quality data” thesis

The core bet of the entire phi line is that the standard scaling laws are not a law about model size — they are an artifact of training on raw, undifferentiated internet text. Phi-3 keeps the dataset recipe from phi-2 and scales it up, filtering public web data hard for educational value and reasoning density, then mixing in LLM-generated synthetic data designed to teach specific skills (logic, math, common-sense reasoning, niche knowledge). The team frames this as the “data optimal regime” for a given model size, in contrast to the “compute optimal regime” of Chinchilla-style scaling: instead of asking how many tokens a 3.8B model can absorb, they ask which 3.3T tokens are worth its limited capacity.

This is the load-bearing idea, and it is also the least verifiable part of the paper. The exact filters, the synthetic-data generation prompts, and the mixture weights are not disclosed — understandably, since they are the moat. So the thesis is supported by benchmark outcomes rather than by a reproducible recipe.

Key results

MMLU: phi-3-mini reaches 69%, versus 68% for GPT-3.5 and 70.5% for Mixtral 8x7B (a 47B-parameter sparse model) — roughly matching both at about a tenth the active size.
MT-bench: 8.38, again in GPT-3.5 territory, indicating the alignment and chat tuning held up, not just raw knowledge.
On-device: quantized to 4 bits, phi-3-mini occupies ~1.8GB and generates over 12 tokens/sec fully offline on an A16-chip iPhone 14 — the literal “on your phone” claim.
Scaling within the family: phi-3-small (7B) and phi-3-medium (14B), trained on 4.8T tokens, hit 75% and 78% on MMLU and 8.7 and 8.9 on MT-bench — the data recipe keeps paying off as size grows.

Why a 3.8B model matters now

The interesting consequence is not that phi-3 beats anyone — it doesn’t beat GPT-4 — but that it collapses the floor for “useful.” A model that fits in phone RAM and runs without a network connection changes the deployment economics: no per-token API cost, no latency to a datacenter, no data leaving the device. For a large class of tasks — summarization, classification, structured extraction, on-device assistants — a 3.8B model at GPT-3.5 quality is enough, and that is where the cost savings are largest.

Limits and open questions

Phi-3-mini’s weakness is exactly the flip side of its data strategy: capacity. The paper is candid that 3.8B parameters simply cannot store broad factual knowledge, so it underperforms on trivia-heavy benchmarks like TriviaQA and is weak at tasks needing wide world knowledge — the authors suggest pairing it with a search engine rather than expecting it to recall facts. It is also primarily English; multilingual coverage is thin (the later phi-3.5 series addresses this). And the headline numbers invite a fair skepticism: benchmark contamination is a standing risk for any model whose training data is curated for “quality,” and because the data pipeline is undisclosed, no one outside Microsoft can audit whether benchmark-adjacent content leaked in. The strong scores are real, but “matches GPT-3.5” should be read as “on these benchmarks,” not as a general claim about capability.

FAQ

How does phi-3-mini run on a phone?

Phi-3-mini is 3.8B parameters, and quantized to 4 bits it occupies about 1.8GB of memory. The paper reports it running natively on an iPhone 14 (A16 chip) fully offline at more than 12 tokens per second, with no server connection.

Is phi-3-mini really as good as GPT-3.5?

On academic benchmarks it is close: 69% on MMLU and 8.38 on MT-bench, comparable to GPT-3.5 and Mixtral 8x7B. But it has far less factual knowledge than those models because of its small size, so “as good” holds for reasoning and chat quality, not for broad recall.

What is the “textbook-quality data” idea in Phi-3?

It is the claim that aggressively filtering web text for educational and reasoning value, plus adding synthetic data that teaches specific skills, lets a small model punch far above its parameter count. Phi-3 calls this training in the “data optimal regime” rather than the compute-optimal regime of standard scaling laws.

What are phi-3-small and phi-3-medium?

They are the 7B and 14B members of the family, trained on 4.8T tokens, reaching 75% and 78% on MMLU and 8.7 and 8.9 on MT-bench. They show the same curated-data recipe continues to scale beyond the 3.8B mini model.

One line: pick the 3.3 trillion tokens carefully and a 3.8B model can sit in your pocket and answer like GPT-3.5 — just don’t ask it to remember everything. Read the original paper on arXiv.