Institution

Independent Researcher

Independent researchers publishing work without a listed institutional affiliation.

AdaPlanBench: Testing Adaptive Planning in LLM Agents

AdaPlanBench: Testing Adaptive Planning in LLM Agents turns adaptive planning under constraints into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

World Models · Independent Researcher

AnchorWorld: Egocentric World Simulation for Embodied AI

AnchorWorld: Egocentric World Simulation for Embodied AI turns egocentric world simulation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

ArcANE: Measuring When Role-Playing Agents Break Character

ArcANE: Measuring When Role-Playing Agents Break Character turns role-playing language agent reliability into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Brain Decoding · Independent Researcher

Brain-Diffuser: Natural Scene Reconstruction from fMRI

Brain-Diffuser turns natural scene reconstruction from fMRI signals into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Diffusion Language Models · Independent Researcher

Diffusion Language Modeling: Promises and Challenges

Diffusion language modeling survey turns the state of diffusion language modeling into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Text-to-Image · Independent Researcher

DIRECT: 3D-Aware Object Insertion with Visual Proxies

DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Brain Decoding · Independent Researcher

DreamDiffusion: EEG-to-Image Generation with Diffusion

DreamDiffusion turns EEG-to-image generation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Biomolecular Modeling · Independent Researcher

DynamicMPNN: Multi-State Protein Design with Inverse Folding

DynamicMPNN turns multi-state protein sequence design into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Diffusion Language Models · Independent Researcher

Factorization-Error-Free Decoding for Diffusion LMs

Factorization-error-free decoding turns speculative decoding for discrete diffusion LMs into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Biomolecular Modeling · Independent Researcher

Feynman-Kac Steering for Controllable Protein Design

Feynman-Kac steering turns controllable protein design with guided diffusion into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

World Models · Independent Researcher

Function2Scene: 3D Indoor Layout from Functional Specs

Function2Scene: 3D Indoor Layout from Functional Specs turns functional 3D scene layout into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

K-BrowseComp: Korean Web-Browsing Agent Benchmark

K-BrowseComp: Korean Web-Browsing Agent Benchmark turns Korean-context web browsing agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Language Models · Independent Researcher

WASH: Averaging 3 LLMs Erases Text Watermarks

Averaging the output distributions of 3 independent LLMs collapses watermark detection z-scores from 5-300 down below 2, and the WASH paper proves why it works with an O(1/sqrt(N)) error bound.

Speech Synthesis · Independent Researcher

A Broad Benchmark for Long-Form Speech Generation

A Broad Benchmark for Long-Form Speech Generation turns long-form speech generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

When Masking Stale Observations Helps Search Agents

When Masking Stale Observations Helps Search Agents turns context management for search agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Brain Decoding · Independent Researcher

MinD-Vis: fMRI Vision Decoding with Latent Diffusion

MinD-Vis turns fMRI-to-image reconstruction with latent diffusion into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Theorem Proving · Independent Researcher

MiniF2F: Formal Olympiad Mathematics Benchmark

MiniF2F turns formal Olympiad-level mathematics benchmarking into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Speech Synthesis · Independent Researcher

MMAE: A Massive Benchmark for Audio Editing Models

MMAE: A Massive Benchmark for Audio Editing Models turns audio editing evaluation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Biomolecular Modeling · Independent Researcher

ProGen2: Protein Language Models for Protein Design

ProGen2 turns protein sequence modeling and design into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Diffusion Language Models · Independent Researcher

SEDD: Discrete Diffusion Language Modeling by Ratios

SEDD turns discrete diffusion language modeling into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Text Embeddings · Independent Researcher

Sentence-BERT: Sentence Embeddings with Siamese BERT

Sentence-BERT turns sentence embeddings for semantic similarity into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

AI Agents · Independent Researcher

SoCRATES: Evaluating Proactive LLM Mediation

SoCRATES: Evaluating Proactive LLM Mediation turns proactive mediation agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

SpatialWorld: Interactive Spatial Reasoning for Agents

SpatialWorld: Interactive Spatial Reasoning for Agents turns interactive spatial reasoning into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

TASTE: Harder Agent Benchmarks from Tool Sequences

TASTE: Harder Agent Benchmarks from Tool Sequences turns tool-use benchmark generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Small Language Models · Independent Researcher

TinyLlama: An Open Small Language Model Recipe

TinyLlama turns open small language model training into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

AI Agents · Independent Researcher

TIDE: Proactive Multi-Problem Discovery with Templates

TIDE: Proactive Multi-Problem Discovery with Templates turns proactive problem discovery into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

ToolMaze: When LLM Agents Must Replan After Tool Failures

ToolMaze: When LLM Agents Must Replan After Tool Failures turns dynamic replanning after tool failures into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Robotics · Independent Researcher

TVRBench: Can Models Move to a Target Viewpoint?

TVRBench: Can Models Move to a Target Viewpoint? turns active 3D viewpoint reproduction into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Segmentation · Independent Researcher

U-Net: The Biomedical Image Segmentation Baseline

U-Net turns biomedical image segmentation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Multimodal Models · Independent Researcher

VideoKR: Knowledge-Intensive Video Understanding

VideoKR: Knowledge-Intensive Video Understanding turns knowledge and reasoning in video understanding into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

World Models · University of Macau

PF-OPSD: When Should an MLLM Trust a World Model's Video?

PF-OPSD teaches a Qwen3.5-9B MLLM to decide when to simulate the future with a video world model, verify the rollout, and fold it into its answer, lifting accuracy +10.6 and +10.9 points on two new QA benchmarks.

Diffusion Models · Independent Researcher

Mean Mode Screaming: Stabilizing 1000-Layer Diffusion Transformers

Very deep DiTs collapse into a mean-dominated state the author calls Mean Mode Screaming. Splitting the residual into mean and centered paths fixes it, training a stable 1000-layer DiT to FID 2.77.