Multimodal Models · Kuaishou Technology
Kwai Keye-VL-2.0: Open Long-Video Multimodal Agent Model
Kwai Keye-VL-2.0 is a 30B-A3B open MoE multimodal model with 256K context, strong long-video scores, and 62.0 on SWE-bench Verified.
Topics
Open-weight model releases and the training recipes behind them.
Multimodal Models · Kuaishou Technology
Kwai Keye-VL-2.0 is a 30B-A3B open MoE multimodal model with 256K context, strong long-video scores, and 62.0 on SWE-bench Verified.
Small Language Models · Independent Researcher
TinyLlama turns open small language model training into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Small Language Models · Hugging Face
SmolLM2 is a 1.7B model overtrained on ~11T tokens through four data stages. It scores 68.7 on HellaSwag and 19.4 on MMLU-Pro, beating Llama3.2-1B — and ships every dataset, not just the weights.
LLM Reasoning · Shanghai AI Laboratory
SU-01, a 30B-A3B open model from Shanghai AI Lab, hits 35 points on IMO 2025 and clears gold lines at IPhO 2024/2025 using only ~338K short SFT trajectories plus a 200-step two-stage RL pipeline.
Code Llama continues training Llama 2 on code, reaching up to 67% on HumanEval and 65% on MBPP, the best open scores at its release, with infilling, instruction following, and 100k-token context support.
DeepSeek-V3 is a 671B-parameter MoE model that activates only 37B params per token, matches leading closed models on many benchmarks, and was pre-trained on 14.8T tokens for just 2.788M H800 GPU hours with open weights.
Gemma is a 2B and 7B family of open-weight models distilled from Gemini research that beats similarly sized open models on 11 of 18 text tasks, shipped with pretrained and instruction-tuned checkpoints.
Llama 2 shipped 7B, 13B, and 70B open-weight models plus Llama 2-Chat, the first open chat model whose RLHF pipeline — including a separate safety reward model and Ghost Attention — was documented in full.
Mellum 2 is JetBrains' open-weight 12B Mixture-of-Experts code model that activates only 2.5B parameters per token, matching dense 4B-14B baselines on software tasks at a fraction of the per-token compute.
Mistral 7B is a 7-billion-parameter open model that outperforms Llama 2 13B on every benchmark tested, uses grouped-query and sliding-window attention for cheap inference, and ships under Apache 2.0.
Mixtral 8x7B routes each token to 2 of 8 experts per layer, so it holds 47B parameters but uses only ~13B per token — and matches or beats Llama 2 70B and GPT-3.5 under Apache 2.0.
Vision-Language-Action · Allen Institute for AI
MolmoAct2 is an open vision-language-action stack that reasons in 3D before acting. On real-world DROID it hits 87.1% success, +38.7 points over the runner-up, and its Molmo2-ER brain beats GPT-5 and Gemini Robotics ER.
Multimodal Models · Sea AI Lab
OpenSearch-VL open-sources data, code, and weights for vision-language search agents that call real search, OCR, and image tools — its 30B-A3B model lifts seven benchmarks by 13.8 points on average over Qwen3-VL.
Open Models · Alibaba Qwen Team
Qwen2.5 is Alibaba's open-weight LLM family spanning 0.5B–72B, pretrained on 18T tokens; the 72B-Instruct flagship rivals Llama-3-405B-Instruct, a model roughly 5x larger.
Meta released Llama 3 as a herd of language models led by a dense 405B-parameter flagship with a 128K context window, trained on 15T+ tokens and openly published with weights.