Experiences

Nvidia.

May 2025 - Aug 2025

San Jose, CA, USA

Research internship at NVIDIA, advised by Prof.Song Han (MIT Han Lab).

Research Intern

May 2025 - Aug 2025

Responsibilities:

Conducted research on efficient training and inference for large generative models, including video diffusion and low-precision (FP8) workflows.
Collaborated with MIT Han Lab and NVIDIA researchers on systems for scalable model deployment.

Nvidia.

Feb 2024 - Aug 2024

Beijing, China

Conduct Research about FP8 Training for Large Language Models. Advised by Prof.Song Han.

Research Intern

Feb 2024 - Aug 2024

Responsibilities:

Propose a Memory Efficient FP8 Training method, COAT that Compress Optimizer states and Activations for FP8 Training.
Publish a first author paper and is accepted by ICLR 2025. Code is open-sourced at code
Participate in NVILA Project, responsible for the FP8 training of vision language models.

Selected Publications

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

ICML 2026 Apr 2026

Haocheng Xi Harman Singh Yuezhou Hu Coleman Hooper Rishabh Tiwari Aditya Tomar Minjae Lee Wonjun Kang Michael W. Mahoney Chenfeng Xu Kurt Keutzer Amir Gholami

We propose LoSA, a locality-aware sparse attention for block-wise diffusion language models. LoSA reuses cached prefix-attention for stable tokens and applies sparse attention only to active tokens, achieving up to +9 accuracy points at aggressive sparsity, 1.54x lower attention density, and 4.14x attention speedup.

sparse attention diffusion language models

Details

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Arxiv 2026 Mar 2026

Shuo Yang* Haocheng Xi* Yilong Zhao Muyang Li Xiaoze Fan Jintao Zhang Han Cai Yujun Lin Xiuyu Li Kurt Keutzer Song Han Chenfeng Xu Ion Stoica

We propose Flash-KMeans, an IO-aware and contention-free k-means for modern GPUs. FlashAssign fuses distance computation with online argmin; Sort-Inverse Update transforms atomic scatters into localized reductions. Achieves up to 17.9x end-to-end speedup, 33x over cuML, and 200x+ over FAISS on H200 GPUs.

k-means GPU systems

Details

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

ICML 2026 Feb 2026

Haocheng Xi Shuo Yang Yilong Zhao Muyang Li Han Cai Xingyang Li Yujun Lin Zhuoyang Zhang Jintao Zhang Xiuyu Li Zhiying Xu Jun Wu Chenfeng Xu Ion Stoica Song Han Kurt Keutzer

We present Quant VideoGen (QVG), a training-free KV cache quantization framework for autoregressive video diffusion with semantic-aware smoothing and progressive residual quantization. Reduces KV cache memory by up to 7× with under 4% end-to-end latency overhead while improving quality over baselines.

autoregressive video diffusion KV cache compression

Details

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

Arxiv 2026 Jan 2026

Haocheng Xi Charlie Ruan Peiyuan Liao Yujun Lin Han Cai Yilong Zhao Shuo Yang Kurt Keutzer Song Han Ligeng Zhu

We unify FP8 precision across RL training and rollout to avoid off-policy numerical mismatch from BF16-train + FP8-rollout, which can destabilize long-horizon RL. Achieves up to 33% rollout speedup, 41% training speedup, and 16% end-to-end speedup over BF16 with stable convergence.

FP8 reinforcement learning LLM reasoning

Details

StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

MLSys 2026 (Best Paper Award) Nov 2025

Tianrui Feng Zhi Li Shuo Yang Haocheng Xi Muyang Li Xiuyu Li Lvmin Zhang Keting Yang Kelly Peng Song Han Maneesh Agrawala Kurt Keutzer Akio Kodaira Chenfeng Xu

We present StreamDiffusionV2, a training-free streaming system that adapts video diffusion models for interactive, low-latency live generation. It integrates an SLO-aware batching scheduler, sink-token-guided rolling KV cache, motion-aware noise controller, and scalable pipeline orchestration across denoising steps and network layers. Achieves 0.5s time-to-first-frame and up to 58.28 FPS (14B) / 64.52 FPS (1.3B) on 4xH100 without TensorRT or quantization.

streaming video generation real-time serving system

Details

Sparse VideoGen 2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

NeurIPS 2025 (Spotlight) Sep 2025

Shuo Yang* Haocheng Xi* Yilong Zhao Muyang Li Jintao Zhang Han Cai Yujun Lin Xiuyu Li Chenfeng Xu Kelly Peng Jianfei Chen Song Han Kurt Keutzer Ion Stoica

We propose a training-free sparse attention framework that uses semantic-aware permutation - clustering and reordering tokens via k-means based on semantic similarity - to achieve a new pareto-frontier in speed-quality tradeoff.

efficient video generation sparse attention

Details

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

ICML 2025 Feb 2025

Rishabh Tiwari* Haocheng Xi* Aditya Tomar Coleman Hooper Sehoon Kim Maxwell Horton Mahyar Najibi Michael W. Mahoney Kurt Keutzer Amir Gholami

We propose a self-speculative decoding framework, QuantSpec, to speedup long-context inference. QuantSpec maintains high acceptance rates (>90%) and reliably provides consistent end-to-end speedups upto ∼ 2.5×.

long context generation KV cache compression

Details

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

ICML 2025 Feb 2025

Haocheng Xi* Shuo Yang* Yilong Zhao Chenfeng Xu Muyang Li Xiuyu Li Yujun Lin Han Cai Jintao Zhang Dacheng Li Jianfei Chen Ion Stoica Kurt Keutzer Song Han

We identify the spatial head and temporal head pattern in attention map and propose to use sparse attention to accelerate. Achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo.

efficient video generation sparse attention

Details

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

ICLR 2025 Oct 2024

Haocheng Xi Han Cai Ligeng Zhu Yao Lu Kurt Keutzer Jianfei Chen Song Han

We propose Dynamic range expansion for FP8 optimizer, and propose FP8 precision flow for FP8 activations. Achieve Lossless performance, end-to-End 1.54x memory reduction and 1.43x training speedup over BF16.

FP8 training memory efficient training

Details

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization

ICML 2024 (Spotlight) Mar 2024

Haocheng Xi Yuxiang Chen Kang Zhao Kai Jun Teh Jianfei Chen Jun Zhu

We propose a new method for efficient and accurate transformer pretraining with INT8 data flow and per-block quantization. Demonstrate effectiveness on GPT2-774M model. Achieve End-to-End 1.42x training speedup and 1.49x memory reduction.

INT8 training per-block quantization

Details

Training Transformers with 4-bit Integers

NeurIPS 2023 Jun 2023

Haocheng Xi Changhao Li Jianfei Chen Jun Zhu

Propose Hadamard Quantizer and Leverage Score Sampling to enable INT4 Precision Matmul in training for speedup. Both the forward and backward pass are quantized into INT4 precision for maximized speedup. Outperforms all existing 4-bit training baselines.

INT4 training Hadamard Quantizer

Details

Projects

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

First Author Apr 2026

sparsity inference

Details

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Co-First Author Mar 2026

sparsity inference

GitHub

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

First Author Feb 2026

quantization inference

GitHub

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

First Author Jan 2026

quantization training

Details

StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

Core Contributor Nov 2025

We present StreamDiffusionV2, a training-free streaming system that adapts video diffusion models for interactive, low-latency live generation. Combines SLO-aware batching, sink-token rolling KV cache, motion-aware noise control, and pipeline orchestration across denoising steps and network layers. Achieves 0.5s time-to-first-frame and up to 58.28 FPS (14B) / 64.52 FPS (1.3B) on 4xH100 without TensorRT or quantization.

inference

GitHub

Sparse VideoGen 2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Co-First Author Feb 2025 - Jun 2025

sparsity inference

GitHub

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

Contributor Sep 2024 - Feb 2025

We propose SpargeAttn, a universal sparse and quantized attention for any model inference. Our method uses a two-stage online filter to select the most important tokens.

sparsity inference

GitHub

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Co-First Author Sep 2024 - Feb 2025

quantization inference

Details

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Co-First Author Sep 2024 - Feb 2025

sparsity inference

GitHub

NVILA: Efficient Frontier Visual Language Models

Contributor Mar 2024 - Nov 2024

We propose a new frontier of visual language models, NVILA, to achieve reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X.

quantization training inference

GitHub

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

First Author Feb 2024 - Sep 2024

quantization training

GitHub

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization

First Author Sep 2023 - Jan 2024

Propose to INT8 precision flow and per-block quantization to enable INT8 pretraining of transformers. Demonstrate effectiveness on GPT2-774M model. Achieve End-to-End 1.42x training speedup and 1.49x memory reduction.

quantization training

GitHub

Training Transformers with 4-bit Integers

First Author Apr 2022 - Dec 2022

quantization training

GitHub

		University of California, Berkeley 2024-Present Ph.D in Computer Science, advised by Prof.Kurt Keutzer GPA: 4 out of 4 Publications: [ICML 2026] LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models [ICML 2026] Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization [Arxiv 2026] Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow [MLSys 2026 Best Paper] StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation [NeurIPS 2025] Sparse VideoGen 2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation [ICML 2025] SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference [ICML 2025] QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [ICML 2025] Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity [CVPR 2025] NVILA: Efficient Frontier Visual Language Models [ICLR 2025] COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training Extracurricular Activities: Sports - Soccer, Badminton, Pooling. Photography
		Tsinghua University 2020-2024 B.E. in Yao Class, Computer Science. Advised by Prof.Jianfei Chen and Prof.Jun Zhu GPA: 3.83 out of 4 Publications: [NeurIPS 2023] Training Transformers with 4-bit Integers [ICML 2024] Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization [ICML 2025] Oscillation-Reduced MXFP4 Training for Vision Transformers

Hi, I am Haocheng

Haocheng Xi

MLsys Researcher at University of California, Berkeley

Experiences

Nvidia.

Research Intern

Responsibilities:

Nvidia.

Research Intern

Responsibilities:

Education

University of California, Berkeley

Ph.D in Computer Science, advised by Prof.Kurt Keutzer

GPA: 4 out of 4

Publications:

Extracurricular Activities:

Tsinghua University

B.E. in Yao Class, Computer Science. Advised by Prof.Jianfei Chen and Prof.Jun Zhu

GPA: 3.83 out of 4

Publications:

Selected Publications

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

Sparse VideoGen 2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization

Training Transformers with 4-bit Integers

Projects

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

Sparse VideoGen 2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

NVILA: Efficient Frontier Visual Language Models

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization

Training Transformers with 4-bit Integers

Skills

Pytorch

Python

CUDA

LaTeX

Reviewers

Featured Posts