Topics

Speech Synthesis

Text-to-speech and voice generation models, including zero-shot, expressive, and dialogue synthesis.

Speech Synthesis · Independent Researcher

A Broad Benchmark for Long-Form Speech Generation

A Broad Benchmark for Long-Form Speech Generation turns long-form speech generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Speech Synthesis · Independent Researcher

MMAE: A Massive Benchmark for Audio Editing Models

MMAE: A Massive Benchmark for Audio Editing Models turns audio editing evaluation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Speech Synthesis · Zhejiang University

SwanSphere: Streaming Spatial Audio Generation From Video and Text

SwanSphere streams first-order ambisonic audio synced to video or text, emitting its first chunk in 0.21s while cutting Frechet Distance to 120.28 vs OmniAudio's 157.67. Quality without waiting for the whole clip.

Speech Synthesis · Microsoft Research

NaturalSpeech 2: Diffusion TTS Beyond Codec LMs

NaturalSpeech 2 uses latent diffusion over neural-audio-codec vectors and scales to 44K hours of speech and singing, aiming for stronger zero-shot prosody than token LMs.

Speech Synthesis · Microsoft Research

VALL-E: Zero-Shot Voice Cloning with Audio Tokens

VALL-E reframes TTS as codec-token language modeling: 60K hours of speech plus a 3-second prompt produce personalized zero-shot speech, but safety and release constraints matter.

Speech Synthesis · ByteDance

SwanVoice: Zero-Shot Speech Synthesis for Long Monologue and Dialogue

SwanVoice is a zero-shot TTS system that generates an entire 1-4 speaker conversation in one pass, keeping voice, mood, and prosody consistent across turns where turn-by-turn synthesis drifts — but content accuracy lags.