The Forgetting Wall in Video and World Models

1Long video generation isn’t a compute problem

If you ask why we can't yet generate long, coherent video, the most common answer is some version of “we need more compute” — bigger models, longer training [2], more frames per second of generation. That answer isn't exactly wrong, but it points at the wrong wall.

For a while, the hard part of video generation was really per-frame quality, and that has largely been solved, so individual frames now look great [3][4]. The problem is, a clip can score near the top on per-frame quality and still fail badly on world consistency, and the two turn out to be largely uncorrelated. The frontier has moved to length and interactivity: generating minute-scale clips, and building world models you can steer and move around inside [5]. However, when you push on length, the failure you hit is not that the model runs out of compute. It's that the model forgets. The character who walked off-screen comes back wrong. The room quietly rearranges itself. Turn the camera away in a world model and turn it back, and the world has been reinvented.

Those aren't quality bugs. They're memory bugs. The Forgetting Wall for long-horizon video is the model's ability to remember its own past, and that turns out to be a surprisingly hard constraint.

2What “memory” actually is here

Let's be concrete about what memory means for these models. An autoregressive video model generates frame by frame [6], and each new frame attends to what came before through the transformer's attention. To do that, it stores a key and a value for every past token (the KV cache). The KV cache is the model's memory of the video so far. If a detail isn't in the cache, the model has no way to be consistent with it.

The problem is how that cache scales. It grows linearly with the number of tokens generated — and video is brutally token-hungry. A single frame is worth thousands of tokens, so a clip that is “only” a few hundred frames long carries a KV cache equivalent to a text context of millions of tokens. This is the same long-context problem that has consumed LLM research for years, but video arrives at the wall much faster and hits it much harder. And not all of that history plays the same role: nearby frames carry the fine continuity of motion, while distant ones mostly sit there as context the model must occasionally recall — an asymmetry we'll come back to.

frame t=1 — **Figure 1: Every frame worths thousands of tokens, and the cache keeps all of them.** Staying coherent from frame 1 to frame T means carrying the memory of everything in between — which is precisely what becomes too large to hold.

frame t=2 — **Figure 1: Every frame worths thousands of tokens, and the cache keeps all of them.** Staying coherent from frame 1 to frame T means carrying the memory of everything in between — which is precisely what becomes too large to hold.

3Why it’s hard, and what failure looks like

Here is the tension at the heart of the Forgetting Wall. Long-term consistency requires long-term memory: if a face leaves the frame and reappears one minute later, the model can only keep it consistent if its KV cache still holds what that face looked like one minute ago. But keeping all of that memory, at full precision, eventually overflows the hardware. So you're caught between two bad options:

Remember everything, and run out of memory. Or bound the memory, and start forgetting.

What makes this genuinely hard is a second problem layered on top: you don't know in advance which memories you'll need. A token from five seconds ago might be irrelevant, or it might be the only place the model recorded how the objects on a table were arranged. Any scheme that throws away “unimportant” history is making a bet about the future, and long-range dependencies are exactly the case where that bet is hardest to get right. It's very likely that the detail dropped to save memory turns out to be exactly the one the model needed to stay consistent later.

The abstract problem becomes obvious the moment you watch a model that has mismanaged its memory. The symptoms are consistent across systems:

Identity drift. Faces, clothing, and objects slowly morph over time. A car’s cabin can be one interior early in a drive and a different one a minute later, because the model no longer has a faithful record of the original.
Scene and layout inconsistency. A room rearranges itself; a door that was on the left reappears on the right. The spatial structure established early isn't reliably preserved.
World-model amnesia. In an interactive world model, you look away and look back and the scene has been regenerated from scratch — the canonical “turn around and the world resets” failure.

All three are the same underlying fault seen from different angles: the relevant past was not available, accurately, in the cache.

car interior at 33 seconds — **Figure 2** shows identity and scene drift. Both frames come from one continuous video. By 2:40 the car’s entire interior — dashboard, instrument cluster, center screen, ambient lighting — has changed: the model has simply lost track of the vehicle it was rendering.

car interior at 2 minutes 40 seconds — **Figure 2** shows identity and scene drift. Both frames come from one continuous video. By 2:40 the car’s entire interior — dashboard, instrument cluster, center screen, ambient lighting — has changed: the model has simply lost track of the vehicle it was rendering.

4The landscape of partial answers

The difficulty isn't only that memory is expensive. It's that saving memory and staying consistent pull in opposite directions. Nobody has fully solved this, but it's worth seeing the whole map, because the proposals on the table are really different ways of confronting the dilemma from Section 3. They fall into several families.

4.1Look at less of the past

The most direct idea is to store less of the past. Sliding-window and local attention only look back a fixed distance [6][7]: memory is bounded, but the model is blind to anything older than the window. Attention sinks and streaming methods keep a few anchor tokens alongside the window so generation stays stable as it slides — a fix for stability, not for long-range recall [8][9]. KV eviction and learned sparse routing try to route each query to just the slice of history it needs [10]. They all rely on the same assumption: that whatever they drop will not be needed later. The hard part is that this is impossible to predict in advance — there is no reliable way to tell, at the moment you drop a token, whether some much later frame will turn out to depend on it. So every fixed rule for what to discard is really a bet about a future the model cannot see, and over a long horizon that bet is one that long-range dependencies routinely lose.

Attention patterns: sliding window, attention sink, and KV eviction. Each row is a generation step; each column a past token in the KV cache.

4.2Summarize the past into a fixed state

A second family changes the math so memory stops growing at all. Linear attention and state-space models [11] fold the entire history into a fixed-size state, so cost stays constant no matter how long the video runs. Sana-Video [12] and Sana-WM [13] are concrete examples. They build their video generator on linear attention instead of the usual quadratic attention, so rather than caching a key and value for every past token, they keep a single fixed-size state and update it as each new frame arrives. FramePack [14] reaches the same fixed budget differently: it compresses each input frame's context by importance, keeping recent frames detailed and packing older ones into fewer tokens, so the total context length never grows. Memory then stays flat as the clip grows longer. It's an elegant ceiling on memory, but a fixed state is a lossy summary: the precise details that long-range consistency depends on are often the first thing a bounded summary washes out.

4.3Keep everything and store it for less

The third family refuses to throw anything away, and instead makes the cache cheaper to hold: KV-cache compression and quantization store every token, just at lower precision. This is the direction I find most promising for video, for one concrete reason: video's KV cache is enormously redundant. Adjacent frames look alike, and neighboring regions within a frame look alike, so there is far less independent information in the cache than its raw size suggests. The catch is that this redundancy is buried under a messy, irregular distribution, so it takes a carefully designed compression algorithm to expose that structure to maintain a high quality.

Block-wise KV quantization: 1, 3, 5 blocks. The whole causal history is kept; diagonal (current) blocks stay full precision while older blocks are quantized, and the KV cache strip grows with each block.

And you can't simply borrow the compressors built for text [15]. They were designed for one-dimensional token streams, not for video's spatiotemporal structure — so a strong LLM KV-quantizer might degrade (Figure 4). The opportunity is real, but it has to be taken on video's own terms.

One attempt: QuantVideoGen (QVG) [1]

QVG is our take on this third family: a KV-cache quantization method built around video's structure rather than borrowed from text. It first subtracts what neighboring tokens share (Semantic-Aware Smoothing) and then refines the leftover residual coarse-to-fine (Progressive Residual Quantization), which lets a pretrained model run on a 2-bit KV cache — up to 7× less memory at under 4% latency overhead, with quality essentially intact. It's training-free and open-source; details are in the paper and code.

Three views of the key cache: original (irregular, hard to quantize), after Semantic-Based Grouping (regular, still hard), and after Centroid Subtraction (regular, easy to quantize). — **Figure 3. QVG exposes the redundancy before spending it.**
**(left)** The original key cache is irregular — values swing widely across channels, so quantizing it directly loses quality.
**(middle)** Semantic-Based Grouping makes it regular, but still hard to quantize.
**(right)** Subtracting each group's centroid removes the shared component, leaving a small, uniform residual that quantizes cleanly at low bit-width — which is what lets QVG run on a 2-bit cache.

Uncompressed memory

KIVI — built for text

QVG (ours) · 2-bit

Figure 4. Same goal, different outcomes on HY-WorldPlay. KIVI — a quantizer built for text — degrades; QVG, designed around video's redundancy, holds the quality of the uncompressed baseline while running on a 2-bit cache.

4.4Other directions

There are a few other directions too. One is to shrink the cache by adopting a more aggressive video VAE to encode each frame into far fewer latent tokens. Deep-compression autoencoders such as DC-AE [16] reach compression ratios about 4× higher than earlier VAEs, and since the KV cache scales inversely with the compression ratio, that 4× cuts the cache to roughly a quarter [17]. The other is to relocate the cache by keeping the full KV history in abundant CPU / host memory and stream each slice back to the GPU when attention needs it. Nothing is forgotten or approximated, but moving the cache across the PCIe bus on every step is slow, so for real-time, interactive generation the latency is usually prohibitive.

5Why this is the constraint to watch

As video models become world models (interactive, long-lived, and something an agent can act inside), memory becomes the defining constraint, playing the same role for world models that context length plays for LLMs. How much of the past a model can afford to remember sets a hard ceiling on how long, coherent, and controllable its world can be.

The answer is probably not a single trick. Recent frames need high-resolution local memory. Distant history may need to be compressed, summarized, or retrieved only when it matters. Spatial structure may eventually live in a more explicit world representation. In other words, future video models will likely need a memory hierarchy, not just a longer cache.

But before we get there, one thing seems clear: forgetting is too damaging. If long video is to stay consistent, the model needs a way to keep far more of its past than today's systems can afford.

References

Xi, Haocheng, et al. “Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization.” arXiv preprint, 2026. arXiv:2602.02958.
Chen, Yukang, et al. “LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation.” arXiv preprint, 2026. arXiv:2605.18739.
Wan Team. “Wan: Open and Advanced Large-Scale Video Generative Models.” arXiv preprint, 2025. arXiv:2503.20314.
Seedance Team. “Seedance 2.0: Advancing Video Generation for World Complexity.” arXiv preprint, 2026. arXiv:2604.14148.
Bruce, Jake, et al. “Genie: Generative Interactive Environments.” ICML, 2024. arXiv:2402.15391.
Huang, Xun, et al. “Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion.” NeurIPS, 2025. arXiv:2506.08009.
Yang, Shuai, et al. “LongLive: Real-time Interactive Long Video Generation.” arXiv preprint, 2025. arXiv:2509.22622.
Xiao, Guangxuan, et al. “Efficient Streaming Language Models with Attention Sinks.” ICLR, 2024. arXiv:2309.17453.
Liu, Kunhao, et al. “Rolling Forcing: Autoregressive Long Video Diffusion in Real Time.” arXiv preprint, 2025. arXiv:2509.25161.
Cai, Shengqu, et al. “Mixture of Contexts for Long Video Generation.” arXiv preprint, 2025. arXiv:2508.21058.
Gu, Albert, and Tri Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv preprint, 2023. arXiv:2312.00752.
Chen, Junsong, et al. “SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer.” arXiv preprint, 2025. arXiv:2509.24695.
Zhu, Haoyi, et al. “SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer.” arXiv preprint, 2026. arXiv:2605.15178.
Zhang, Lvmin, et al. “Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models.” arXiv preprint, 2025. arXiv:2504.12626.
Liu, Zirui, et al. “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.” ICML, 2024. arXiv:2402.02750.
Chen, Junyu, et al. “Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models.” arXiv preprint, 2024. arXiv:2410.10733.
He, Wenkun, et al. “DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space.” arXiv preprint, 2025. arXiv:2509.25180.

Citation

If you find it helpful, please cite:

@article{xi2026quant,
  title   = {Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization},
  author  = {Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Li, Muyang and Cai, Han and Li, Xingyang and Lin, Yujun and Zhang, Zhuoyang and Zhang, Jintao and Li, Xiuyu and others},
  journal = {arXiv preprint arXiv:2602.02958},
  year    = {2026}
}

@misc{xi2026memory,
  title  = {The Forgetting Wall in Video and World Models},
  author = {Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Li, Muyang and Cai, Han and Li, Xingyang and Lin, Yujun and Zhang, Zhuoyang and Zhang, Jintao and Li, Xiuyu and others},
  year   = {2026},
  month  = {June},
  url    = {https://haochengxi.github.io/posts/forgetting-wall/},
  note   = {Blog post}
}