BlockVid represents a major leap forward in long-video generation, tackling one of the hardest open problems in video generation, i.e, producing coherent, high-fidelity, minute-long clips without collapse, drift, or degradation over time. Developed by DAMO Academy, ZIP Lab, and Hupan Lab, BlockVid enhances the semi-autoregressive block diffusion paradigm with innovations that directly address KV-cache error accumulation which ensures stable, realistic, and compelling long-horizon video synthesis.
Key Highlights
Semantic-Aware Sparse KV Cache: A novel KV-cache mechanism selectively stores only the most meaningful tokens, retrieving semantically aligned historical context instead of blindly accumulating past errors. This dramatically reduces long-horizon drift and preserves subject/background consistency.
Block Forcing + Self Forcing Training Strategy: A new training recipe combines Block Forcing (for chunk-to-chunk semantic alignment) with Self Forcing (closing the train–test gap), preventing models from drifting, morphing identities, or losing scene structure as the video grows longer.
Chunk-Level Noise Scheduling & Shuffling: Noise progressively increases across chunks and is locally shuffled at boundaries, smoothing transitions and minimizing abrupt artifacts that commonly plague minute-long generations.
LV-Bench: Fine-Grained Long-Video Benchmark: The authors introduce LV-Bench, a dataset of 1,000 minute-long videos with dense 2–5s annotations, plus a new Video Drift Error (VDE) metric suite to quantify long-range temporal consistency (clarity, subject identity, motion, aesthetics, background stability).
State-of-the-Art Long-Video Coherence: BlockVid significantly outperforms leading baselines like MAGI-1, SkyReels-V2, FramePack, and Self Forcing:
- 22.2% improvement on VDE-Subject
- 19.4% improvement on VDE-Clarity
- Top scores in subject consistency, background stability, motion smoothness, and image quality on LV-Bench and VBench.
High-Quality Minute-Long Video Generation
BlockVid maintains sharpness, color stability, and structural coherence across 60-second clips, where many models collapse after ~10–20 seconds.
Why It Matters
BlockVid demonstrates a critical breakthrough:
scaling diffusion-based video generation to realistic minute-long horizons with stability, fidelity, and semantic coherence.
This unlocks new potential for:
- Long-form storytelling & filmmaking
- World models & simulation
- Virtual production & advertising
- Embodied AI training environments
- High-resolution cinematic generation
- Real-time creative workflows
BlockVid’s semi-autoregressive design shows the future of video AI lies in chunkwise generation with intelligent memory, coherent dynamics, and architecture-level drift control.
Explore More
- Paper: arXiv:2511.22973v1
- Project Page: https://ziplab.co/BlockVid
Related LearnOpenCV Posts:
- FramePack: https://learnopencv.com/framepack-video-diffusion/
- Stable Diffusion: https://learnopencv.com/stable-diffusion-3/




5K+ Learners
Join Free VLM Bootcamp3 Hours of Learning