NICS-EffAlg

NeurIPS 2024 Edge-Device LLM Competition Team NICS-EffAlg Solutions (2nd Place)

2024-12-18T00:00:00+00:00

Introduction

In this blog post, we introduce the solutions presented by Team NICS-EffAlg, which participated in the Edge-LLM Competition. Composed of members from Tsinghua University and Infinigence AI.The team showcased their innovative approach to compressing the Llama-3.1-8b, QWen2-7b, and Phi-2 models. Additionally, they developed a new LLM model, C4NGPT-0.8b, specifically for Track 2 of the competition.

The team won the 2nd place in both the Model Compression Track and the Training From Scratch Track at the NeurIPS 2024 Edge-Device LLM Competition.

Team Members: Zhihang Yuan, Hanling Zhang, Shiyao Li, Xuefei Ning, and Yu Wang.

Track 1: Model Compression

Motivation

The primary goal of Track 1 was to compress the Llama 3.1 8B model by 50%. The team recognized that one-shot pruning was not feasible and opted for an iterative approach. They explored various pruning techniques, including vocabulary pruning, width pruning, and decomposition, to achieve the desired compression ratio.

Pipeline and Techniques

Vocabulary Pruning:

The team removed 40% of the least frequent vocabulary tokens. Fine-tuning was performed post-pruning to recover performance.

Width Pruning:

Utilized LLM-Pruner for width pruning. Reduced the number of heads from 32 to 28 (12.5% reduction) and decreased the intermediate size by 15%.

Decomposition:

Introduced DecompLLM, built upon ASVD and SVD-LLM with three key improvements. Employed iterative top-K compression, where layers that minimally increased perplexity were pruned and frozen during fine-tuning.

Fine-tuning Configuration

Recovery Fine-tuning:

Used LoRA with parameters: r=8, alpha=16, dropout=0.1. Dataset: Alpaca (English and Chinese) processed into chat format. Learning rate: 2e-5, batch size: 4, iterations: 3000.

DecompLLM Fine-tuning:

Partially trained with a small fraction of the data (10%). Fine-tuned with the same dataset and configuration as above.

Empirical Findings

Calibration Dataset Importance: The team found that the length and diversity of the calibration dataset significantly impacted performance. Results showed that longer sequences (e.g., 256 tokens) generally performed better.

Pruning Ratio Impact: Different pruning ratios and datasets were tested, with the best results achieved using a pruning ratio of 0.2 on the alpaca bilingual chat dataset.

Track 2: C4NGPT-0.8b Model

Model Architecture

nGPT Architecture: Designed a model with 0.8B parameters, based on the normalized Transformer architecture. Key features: Faster convergence and improved performance on downstream tasks due to representation learning on a hypersphere.

Pretraining Phase

Data Processing:

C4 data was filtered to remove sensitive content, URLs, and repeats. Included 15% Chinese corpus and 85% English corpus, with a sequence length of 1024.

Pretraining:

Trained for 2000 GPU hours (32x A100 40G, 3 days). Batch size: 512, Adam optimizer.

SFT (Supervised Fine-Tuning)

Data Processing:

Alpaca data was transformed into a Llama3-like prompt format, including both English and Chinese data.

Training:

Batch size: 128, Adam optimizer. Trained for 1 epoch with 1 GPU hour.

Conclusion

Team NICS-EffAlg’s approach to model compression and their introduction of the C4NGPT model showcase innovative techniques in reducing model size while maintaining performance. Their iterative pruning strategy and the nGPT architecture demonstrate promising directions for efficient LLM deployment on edge devices.

Acknowledgments

Special thanks to the NICS-efc Lab and Infinigence AI for their contributions to this research.

Generative Model Compression and Acceleration

2024-07-31T00:00:00+00:00

Slides

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

2024-06-25T00:00:00+00:00

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

2024-06-25T00:00:00+00:00

DiTFastAttn: Attention Compression for Diffusion Transformer Models

2024-06-07T00:00:00+00:00

Code Paper

Diffusion Transformers (DiT) have emerged as a powerful tool for image and video generation tasks. However, their quadratic computational complexity due to the self-attention mechanism poses a significant challenge, particularly for high-resolution and long video tasks. This paper mitigate the computational bottleneck in DiT models by introducing a novel post-training model compression method. We identify three key redundancies in the attention computation during DiT inference and we propose three techniques.

Window Attention with Residual Caching - Reduces spatial redundancy.
Temporal Similarity Reduction - Exploit the similarity between steps.
Conditional Redundancy Elimination - Skips redundant computations during conditional generation.

Generation Speed Comparasion

Image Generation Results

1K Image Generation Results

2K Image Generation Results

Video Generation Results

Raw

FLOPs Reduction

You can find the code for DiTFastAttn on GitHub at DiTFastAttention. Feel free to check out the repository for more details and to access the code.

An Introduction to Quantization of Large Language Models

2023-08-30T00:00:00+00:00

Video Slides

The talk is in Chinese, and the English-version video will soon be available.

Model Compression Towards Efficient Deep Learning Inference

2023-08-29T00:00:00+00:00

Slides

Neural Architecture Search and Architecture Encoding

2022-12-12T00:00:00+00:00

Slides

This talk includes three parts:

First, I introduce the basics of neural architecture search (NAS), including background, motivation, problem definition.
Then, I give an overview of NAS researches, organized by research questions, i.e., “what to search”, “how to search”, “what does the search tell us”. I summarize existing solutions to solve these questions.
Finally, I introduce my researches that utilize architecture encoding to answer “how to search” along two directions, i.e., accelerate exploration and improve evaluation.

The talk is in Chinese, and the English-version video will soon be available.