Author Image

Hi, I am Haocheng

Haocheng Xi

MLsys Researcher at University of California, Berkeley

I am a first-year PhD student in UC Berkeley. My research interest is about efficient training and inference of large language models and diffusion models. I graduate from Yao Class, Tsinghua University, led by Prof.Andrew Yao.

My advisor is Prof.Kurt Keutzer, and I also work closely with Prof.Song Han in MIT. Back in Tsinghua University, I was fortunate to be advised by Prof.Jianfei Chen and Prof.Jun Zhu. I was also fortunate to be advised by Prof.Sheng Wang in University of Washington.

Large Language Models
Diffusion Models
Efficiency
Quantization
Sparsity
Reasoning

Experiences

1
Nvidia.

Feb 2024 - Aug 2025

Beijing, China

Conduct Research about FP8 Training for Large Language Models. Advised by Prof.Song Han.

Research Intern

Feb 2024 - Aug 2025

Responsibilities:
  • Propose a Memory Efficient FP8 Training method, COAT that Compress Optimizer states and Activations for FP8 Training.
  • Publish a first author paper and is accepted by ICLR 2025. Code is open-sourced at code
  • Participate in NVILA Project, responsible for the FP8 training of vision language models.

Education

Projects

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
Co-First Author Sep 2024 - Feb 2025

We identify the spatial head and temporal head pattern in attention map and propose to use sparse attention to accelerate. Achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo.

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Co-First Author Sep 2024 - Feb 2025

We propose a self-speculative decoding framework, QuantSpec, to speedup long-context inference. QuantSpec maintains high acceptance rates (>90%) and reliably provides consistent end-to-end speedups upto ∼ 2.5×.

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
First Author Feb 2024 - Sep 2024

We propose Dynamic range expansion for FP8 optimizer, and propose FP8 precision flow for FP8 activations. Achieve Lossless performance, end-to-End 1.54x memory reduction and 1.43x training speedup over BF16.

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
First Author Sep 2023 - Jan 2024

Propose to INT8 precision flow and per-block quantization to enable INT8 pretraining of transformers. Demonstrate effectiveness on GPT2-774M model. Achieve End-to-End 1.42x training speedup and 1.49x memory reduction.

Training Transformers with 4-bit Integers
Training Transformers with 4-bit Integers
First Author Apr 2022 - Dec 2022

Propose Hadamard Quantizer and Leverage Score Sampling to enable INT4 Precision Matmul in training for speedup. Both the forward and backward pass are quantized into INT4 precision for maximized speedup. Outperforms all existing 4-bit training baselines.

Publications

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
Arxiv Feb 2025

We identify the spatial head and temporal head pattern in attention map and propose to use sparse attention to accelerate. Achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo.

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Arxiv Feb 2025

We propose a self-speculative decoding framework, QuantSpec, to speedup long-context inference. QuantSpec maintains high acceptance rates (>90%) and reliably provides consistent end-to-end speedups upto ∼ 2.5×.

Skills

Featured Posts