Author Image

Hi, I am Haocheng

Haocheng Xi

MLsys Researcher at University of California, Berkeley

I am a first-year PhD student in UC Berkeley. My research interest is about efficient training and inference of large language models and diffusion models. I graduate from Yao Class, Tsinghua University, led by Prof.Andrew Yao.

My advisor is Prof.Kurt Keutzer, and I also work closely with Prof.Song Han in MIT. Back in Tsinghua University, I was fortunate to be advised by Prof.Jianfei Chen and Prof.Jun Zhu. I was also fortunate to be advised by Prof.Sheng Wang in University of Washington.

Large Language Models
Diffusion Models
Efficiency
Quantization
Sparsity
Reasoning

Experiences

1
Nvidia.

Feb 2024 - Aug 2025

Beijing, China

Conduct Research about FP8 Training for Large Language Models. Advised by Prof.Song Han.

Research Intern

Feb 2024 - Aug 2025

Responsibilities:
  • Propose a Memory Efficient FP8 Training method, COAT that Compress Optimizer states and Activations for FP8 Training.
  • Publish a first author paper and is accepted by ICLR 2025. Code is open-sourced at code
  • Participate in NVILA Project, responsible for the FP8 training of vision language models.

Education

Projects

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
Co-First Author Sep 2024 - Feb 2025

We identify the spatial head and temporal head pattern in attention map and propose to use sparse attention to accelerate. Achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo.

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Co-First Author Sep 2024 - Feb 2025

We propose a self-speculative decoding framework, QuantSpec, to speedup long-context inference. QuantSpec maintains high acceptance rates (>90%) and reliably provides consistent end-to-end speedups upto ∼ 2.5×.

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
First Author Feb 2024 - Sep 2024

We propose Dynamic range expansion for FP8 optimizer, and propose FP8 precision flow for FP8 activations. Achieve Lossless performance, end-to-End 1.54x memory reduction and 1.43x training speedup over BF16.

NVILA: Efficient Frontier Visual Language Models
NVILA: Efficient Frontier Visual Language Models
Contributor Mar 2024 - Nov 2024

We propose a new frontier of visual language models, NVILA, to achieve reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X.

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
Contributor Sep 2024 - Feb 2025

We propose SpargeAttn, a universal sparse and quantized attention for any model inference. Our method uses a two-stage online filter to select the most important tokens.

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
First Author Sep 2023 - Jan 2024

Propose to INT8 precision flow and per-block quantization to enable INT8 pretraining of transformers. Demonstrate effectiveness on GPT2-774M model. Achieve End-to-End 1.42x training speedup and 1.49x memory reduction.

Training Transformers with 4-bit Integers
Training Transformers with 4-bit Integers
First Author Apr 2022 - Dec 2022

Propose Hadamard Quantizer and Leverage Score Sampling to enable INT4 Precision Matmul in training for speedup. Both the forward and backward pass are quantized into INT4 precision for maximized speedup. Outperforms all existing 4-bit training baselines.

Selected Publications

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
Arxiv Feb 2025

We identify the spatial head and temporal head pattern in attention map and propose to use sparse attention to accelerate. Achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo.

We propose a self-speculative decoding framework, QuantSpec, to speedup long-context inference. QuantSpec maintains high acceptance rates (>90%) and reliably provides consistent end-to-end speedups upto ∼ 2.5×.

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
ICLR 2025 Oct 2024

We propose Dynamic range expansion for FP8 optimizer, and propose FP8 precision flow for FP8 activations. Achieve Lossless performance, end-to-End 1.54x memory reduction and 1.43x training speedup over BF16.

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization

We propose a new method for efficient and accurate transformer pretraining with INT8 data flow and per-block quantization. Demonstrate effectiveness on GPT2-774M model. Achieve End-to-End 1.42x training speedup and 1.49x memory reduction.

Training Transformers with 4-bit Integers
NeurIPS 2023 Jun 2023

Propose Hadamard Quantizer and Leverage Score Sampling to enable INT4 Precision Matmul in training for speedup. Both the forward and backward pass are quantized into INT4 precision for maximized speedup. Outperforms all existing 4-bit training baselines.

Skills

Featured Posts