Diffusion Models · World Models
MIGA: Train-Free Infinite-Frame Generation for Consistent Long Videos
MIGA turns a fixed-length video diffusion model into a 1000+-frame generator with no training and constant memory, hitting 97.82 overall on VBench with VideoCrafter2 — about 2.8 points over FIFO-Diffusion.
Quick answer
MIGA generates videos of 1000+ frames from a video diffusion model that was only ever trained on short clips — without any fine-tuning and with constant memory. On VBench using VideoCrafter2 as the base, MIGA scores 97.82 overall, 97.66 subject consistency, and 96.99 background consistency, beating the training-free baseline FIFO-Diffusion by roughly 2.8 points overall (+4.7% subject consistency, +2.0% background consistency). The whole point is that you keep your existing short-video model and bolt MIGA onto the sampler.
The mismatch that breaks long videos
Sliding-window samplers like FIFO-Diffusion let you stream out arbitrarily long videos by denoising a queue of frames at staggered noise levels. The problem MIGA targets is the train-inference mismatch: the base model was trained to denoise frames that all sit at the same noise level, but the sliding window feeds it frames at different noise levels at once. The model has never seen that input distribution, so quality degrades and the video drifts. Existing fixes either retrain (expensive, defeats the purpose) or paper over the symptoms.
How MIGA works
MIGA stacks two training-free mechanisms on the sampler.
Two-stage Training-inference Alignment (TTA). Stage one runs a zigzag iterative denoising over a narrow window (the optimal zigzag width is 4 frames) to gently bring frames toward a shared noise level. Stage two then applies unified noise-level denoising, so the input the base model sees matches what it was trained on. This directly closes the distribution gap rather than hiding it.
Dual Consistency Enhancement (DCE). Two pieces fight long-range drift. A self-reflection step lets the model re-examine and correct its own predictions during sampling, and long-range frame guidance (the optimal setting uses 6 guidance frames) anchors new frames to earlier content so the subject and background stay stable across hundreds of frames instead of slowly morphing.
Neither component touches the weights — MIGA is a sampler-side intervention, which is why it transfers across base models.
Key results
- VBench overall (VideoCrafter2, 128 frames): 97.82 for MIGA vs 95.02 for FIFO-Diffusion and 96.95 for FreeLong — a state-of-the-art training-free result.
- Consistency gains: 97.66 subject consistency and 96.99 background consistency, reported as +4.7% subject and +2.0% background over FIFO-Diffusion.
- Scale: 1000+ frames with constant memory consumption — memory does not grow with video length, which is what makes “infinite-frame” practical.
- NarrLV (Wan2.1-1.3B base, TNA=2): 79.32 scene attributes, 67.87 target attributes, 67.94 target actions — evidence it holds up on narrative, multi-event prompts, not just looped scenes.
- Ablations: TTA alone adds +2.03% overall and DCE alone adds +1.73%, so both halves carry weight rather than one doing all the work.
Why this is worth attention
The honest pitch is efficiency, not raw fidelity. MIGA does not train a better video model; it makes the model you already have stream much longer without the usual collapse, and it does so on two different bases (VideoCrafter2 at 16 latents, Wan2.1-1.3B at 21 latents). For anyone who cannot afford to train a long-video model from scratch, a training-free sampler upgrade that adds ~2-3 VBench points and removes the memory ceiling is a practical win. It was accepted at ICML 2026, which is a reasonable signal the alignment argument held up under review.
Limits and open questions
The gains are real but modest in absolute terms — VBench consistency numbers were already in the high 90s, so 97.82 vs 96.95 is a narrow margin, and VBench rewards smoothness and consistency more than it punishes a video that is consistent but boring. The NarrLV target-attribute and target-action scores sit in the high 60s, which is far from solved and suggests long narrative videos still lose semantic fidelity even when they look stable. MIGA inherits every weakness of its base model: it cannot generate content the base cannot, and a weak short-clip model will still produce a weak long video. The method also adds sampler-side compute (zigzag denoising, self-reflection, guidance frames) per frame, so “constant memory” does not mean “free” — wall-clock cost per frame is not the headline and deserves scrutiny before deployment.
FAQ
What is MIGA in video generation?
MIGA is a training-free method that lets a standard video diffusion model generate 1000+-frame videos with constant memory. It adds two sampler-side mechanisms — two-stage training-inference alignment and dual consistency enhancement — without changing any model weights.
How does MIGA differ from FIFO-Diffusion?
Both are training-free sliding-window samplers, but MIGA explicitly fixes the train-inference noise-level mismatch that FIFO-Diffusion suffers from. On VBench with VideoCrafter2 it scores 97.82 overall vs FIFO-Diffusion’s 95.02, with +4.7% subject and +2.0% background consistency.
Does MIGA need any fine-tuning or retraining?
No. MIGA is entirely a sampling-time intervention and touches no weights, which is why it transfers across base models like VideoCrafter2 and Wan2.1-1.3B without per-model training.
What benchmarks does MIGA report?
MIGA reports VBench (97.82 overall on VideoCrafter2) for general quality and NarrLV (scene 79.32, target attributes 67.87, target actions 67.94 on Wan2.1-1.3B) for narrative, multi-event long videos.
Is MIGA actually better video quality or just longer?
Mainly longer and more stable, not dramatically higher fidelity. The VBench gain over FreeLong is under one point, and narrative-attribute scores stay in the high 60s — MIGA’s contribution is reaching long lengths without the usual quality collapse, on a model you did not have to retrain.
One line: keep your short-video model, align the noise levels and anchor the frames, and it streams to 1000+ frames without retraining. Read the original paper on arXiv.