Topics

Alignment

Methods for steering models toward preferred, safer, or more useful behavior.

From AGI to ASI: DeepMind's Map of Superintelligence Pathways

Google DeepMind's report lays out four non-exclusive paths from AGI to ASI and treats each bottleneck, from data walls to regulation, as an open research question.

AI Agents · Independent Researcher

AdaPlanBench: Testing Adaptive Planning in LLM Agents

AdaPlanBench: Testing Adaptive Planning in LLM Agents turns adaptive planning under constraints into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

ArcANE: Measuring When Role-Playing Agents Break Character

ArcANE: Measuring When Role-Playing Agents Break Character turns role-playing language agent reliability into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Language Models · Independent Researcher

WASH: Averaging 3 LLMs Erases Text Watermarks

Averaging the output distributions of 3 independent LLMs collapses watermark detection z-scores from 5-300 down below 2, and the WASH paper proves why it works with an O(1/sqrt(N)) error bound.

Reinforcement Learning · Tsinghua University

CHERRL: A Controlled Sandbox for Reward Hacking in Rubric RL

CHERRL injects four known judge biases to reliably reproduce reward hacking in rubric RL; an agent reading only training logs pinned the onset with 11-step total interval error and missed none of six runs.

AI Agents · Independent Researcher

SoCRATES: Evaluating Proactive LLM Mediation

SoCRATES: Evaluating Proactive LLM Mediation turns proactive mediation agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

TASTE: Harder Agent Benchmarks from Tool Sequences

TASTE: Harder Agent Benchmarks from Tool Sequences turns tool-use benchmark generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

ToolMaze: When LLM Agents Must Replan After Tool Failures

ToolMaze: When LLM Agents Must Replan After Tool Failures turns dynamic replanning after tool failures into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Shanghai AI Laboratory

AgentDoG 1.5: A Lightweight Guardrail for AI Agent Safety

AgentDoG 1.5 trains 0.8B-8B agent-safety guard models on only ~1k samples, hits 92.2% accuracy on R-Judge with the 4B variant, rivals GPT-5.4, and cuts agentic-RL deployment overhead by two orders of magnitude.

Alignment · Anthropic

Constitutional AI: Training a Harmless Assistant from AI Feedback

Constitutional AI trains a harmless assistant with almost no human harm labels — a model critiques and revises its own answers against a written list of principles, then learns from AI-generated preferences (RLAIF).

Alignment · OpenAI

InstructGPT: How RLHF Beat a Model 100x Its Size

OpenAI's InstructGPT used human feedback to align GPT-3, and evaluators preferred its 1.3B model over the 175B GPT-3 — more helpful with 100x fewer parameters.

Alignment · OpenAI

PPO Explained: The Clipped Objective Behind RLHF

PPO keeps policy-gradient RL stable with a clipped surrogate objective — almost as well-behaved as TRPO but far simpler — which made it the default RL engine behind RLHF for ChatGPT and InstructGPT.

Alignment · Seoul National University

Why Personality Tests Mischaracterize LLM Behavior

Giving an LLM the Big Five or a values survey predicts almost nothing about how it acts in real queries: cross-method agreement was only Spearman 0.31 (values) and 0.26 (personality), versus 0.74-0.77 within-survey.

Multimodal Models · University of California, Davis

When Vision Speaks for Sound: The Audio-Visual Clever Hans Effect

Top video models look like they hear audio but really guess it from the picture. This paper's THUD probes catch the cheat, and a 10K-sample fix lifts audio grounding by 28 points.

Alignment · Stanford University

DPO Explained: Aligning LLMs Without the RLHF Reward Model

Direct Preference Optimization solves the RLHF problem with a single classification-style loss on preference pairs — no separate reward model, no RL loop, no sampling during training.