<br />
<b>Deprecated</b>:  The each() function is deprecated. This message will be suppressed on further calls in <b>/home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php</b> on line <b>456</b><br />
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://nics-effalg.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://nics-effalg.com/" rel="alternate" type="text/html" /><updated>2025-04-14T06:20:13+00:00</updated><id>https://nics-effalg.com/feed.xml</id><title type="html">NICS-EffAlg</title><subtitle>We are pioneers in energy-efficient AI hardware and software solutions that unlock exponential performance gains.</subtitle><entry><title type="html">NeurIPS 2024 Edge-Device LLM Competition Team NICS-EffAlg Solutions (2nd Place)</title><link href="https://nics-effalg.com/NeurIPS2024_EdgeLLM_Competition" rel="alternate" type="text/html" title="NeurIPS 2024 Edge-Device LLM Competition Team NICS-EffAlg Solutions (2nd Place)" /><published>2024-12-18T00:00:00+00:00</published><updated>2024-12-18T00:00:00+00:00</updated><id>https://nics-effalg.com/NeurIPS2024_EdgeLLM_Competition</id><content type="html" xml:base="https://nics-effalg.com/NeurIPS2024_EdgeLLM_Competition"><![CDATA[<!-- <span>
    <a class="custom_buttom" href="../assets/ppt/2024-10-18-GMCA.pdf">
    Slides
    </a>
</span> -->

<h2 id="introduction">Introduction</h2>

<p>In this blog post, we introduce the solutions presented by Team NICS-EffAlg, which participated in the Edge-LLM Competition. Composed of members from Tsinghua University and Infinigence AI.The team showcased their innovative approach to compressing the Llama-3.1-8b, QWen2-7b, and Phi-2 models. Additionally, they developed a new LLM model, C4NGPT-0.8b, specifically for Track 2 of the competition.</p>

<p>The team won the 2nd place in both the Model Compression Track and the Training From Scratch Track at the NeurIPS 2024 Edge-Device LLM Competition.</p>

<div style="display: flex; justify-content: space-between; width: 100%;">
    <figure style="width: 33%; height: auto;">
        <img src="../assets/posts_images/neurips2024_edgellm_competition/edgellm_competition_award1.jpg" alt="pixart1k_result" />
    </figure>

    <figure style="width: 33%; height: auto;">
        <img src="../assets/posts_images/neurips2024_edgellm_competition/edgellm_competition_award2.jpg" alt="DiTFastAttn_overview" />
    </figure>

    <figure style="width: 30%; height: auto;">
        <img src="../assets/posts_images/neurips2024_edgellm_competition/edgellm_competition_talk.jpg" alt="DiTFastAttn_overview" />
    </figure>
</div>

<p>Team Members: Zhihang Yuan, Hanling Zhang, Shiyao Li, Xuefei Ning, and Yu Wang.</p>

<h2 id="track-1-model-compression">Track 1: Model Compression</h2>
<h3 id="motivation">Motivation</h3>
<p>The primary goal of Track 1 was to compress the Llama 3.1 8B model by 50%. The team recognized that one-shot pruning was not feasible and opted for an iterative approach. They explored various pruning techniques, including vocabulary pruning, width pruning, and decomposition, to achieve the desired compression ratio.</p>

<h3 id="pipeline-and-techniques">Pipeline and Techniques</h3>
<h4 id="vocabulary-pruning">Vocabulary Pruning:</h4>

<p>The team removed 40% of the least frequent vocabulary tokens.
Fine-tuning was performed post-pruning to recover performance.</p>

<h4 id="width-pruning">Width Pruning:</h4>

<p>Utilized LLM-Pruner for width pruning.
Reduced the number of heads from 32 to 28 (12.5% reduction) and decreased the intermediate size by 15%.</p>

<h4 id="decomposition">Decomposition:</h4>

<p>Introduced DecompLLM, built upon ASVD and SVD-LLM with three key improvements.
Employed iterative top-K compression, where layers that minimally increased perplexity were pruned and frozen during fine-tuning.</p>

<h4 id="fine-tuning-configuration">Fine-tuning Configuration</h4>
<h4 id="recovery-fine-tuning">Recovery Fine-tuning:</h4>

<p>Used LoRA with parameters: r=8, alpha=16, dropout=0.1.
Dataset: Alpaca (English and Chinese) processed into chat format.
Learning rate: 2e-5, batch size: 4, iterations: 3000.</p>

<h4 id="decompllm-fine-tuning">DecompLLM Fine-tuning:</h4>

<p>Partially trained with a small fraction of the data (10%).
Fine-tuned with the same dataset and configuration as above.</p>

<h4 id="empirical-findings">Empirical Findings</h4>

<p>Calibration Dataset Importance:
The team found that the length and diversity of the calibration dataset significantly impacted performance.
Results showed that longer sequences (e.g., 256 tokens) generally performed better.</p>

<p>Pruning Ratio Impact:
Different pruning ratios and datasets were tested, with the best results achieved using a pruning ratio of 0.2 on the alpaca bilingual chat dataset.</p>

<h2 id="track-2-c4ngpt-08b-model">Track 2: C4NGPT-0.8b Model</h2>
<h3 id="model-architecture">Model Architecture</h3>
<p>nGPT Architecture:
Designed a model with 0.8B parameters, based on the normalized Transformer architecture.
Key features: Faster convergence and improved performance on downstream tasks due to representation learning on a hypersphere.</p>

<h3 id="pretraining-phase">Pretraining Phase</h3>
<h4 id="data-processing">Data Processing:</h4>

<p>C4 data was filtered to remove sensitive content, URLs, and repeats.
Included 15% Chinese corpus and 85% English corpus, with a sequence length of 1024.</p>

<h3 id="pretraining">Pretraining:</h3>
<p>Trained for 2000 GPU hours (32x A100 40G, 3 days).
Batch size: 512, Adam optimizer.</p>

<h3 id="sft-supervised-fine-tuning">SFT (Supervised Fine-Tuning)</h3>
<h4 id="data-processing-1">Data Processing:</h4>

<p>Alpaca data was transformed into a Llama3-like prompt format, including both English and Chinese data.</p>

<h4 id="training">Training:</h4>

<p>Batch size: 128, Adam optimizer.
Trained for 1 epoch with 1 GPU hour.</p>

<h2 id="conclusion">Conclusion</h2>
<p>Team NICS-EffAlg’s approach to model compression and their introduction of the C4NGPT model showcase innovative techniques in reducing model size while maintaining performance. Their iterative pruning strategy and the nGPT architecture demonstrate promising directions for efficient LLM deployment on edge devices.</p>

<h2 id="acknowledgments">Acknowledgments</h2>
<p>Special thanks to the NICS-efc Lab and Infinigence AI for their contributions to this research.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Generative Model Compression and Acceleration</title><link href="https://nics-effalg.com/GMCA" rel="alternate" type="text/html" title="Generative Model Compression and Acceleration" /><published>2024-07-31T00:00:00+00:00</published><updated>2024-07-31T00:00:00+00:00</updated><id>https://nics-effalg.com/GMCA</id><content type="html" xml:base="https://nics-effalg.com/GMCA"><![CDATA[<p><span>
    <a class="custom_buttom" href="../assets/ppt/2024-10-18-GMCA.pdf">
    Slides
    </a>
</span></p>]]></content><author><name></name></author><summary type="html"><![CDATA[Slides]]></summary></entry><entry><title type="html">ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation</title><link href="https://nics-effalg.com/ViDiT_Q" rel="alternate" type="text/html" title="ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation" /><published>2024-06-25T00:00:00+00:00</published><updated>2024-06-25T00:00:00+00:00</updated><id>https://nics-effalg.com/ViDiTQ</id><content type="html" xml:base="https://nics-effalg.com/ViDiT_Q"><![CDATA[<iframe src="https://a-suozhang.xyz/viditq.github.io/" style="
    position: fixed;
    right: 0px;
    width: 100%;
    border: none;
    margin: 0;
    padding: 0;
    overflow: hidden;
    z-index: 999999;
    height: 100%;
  ">
</iframe>]]></content><author><name></name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression</title><link href="https://nics-effalg.com/MoA" rel="alternate" type="text/html" title="MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression" /><published>2024-06-25T00:00:00+00:00</published><updated>2024-06-25T00:00:00+00:00</updated><id>https://nics-effalg.com/MoA</id><content type="html" xml:base="https://nics-effalg.com/MoA"><![CDATA[<iframe src="https://thu-nics.github.io/MoA_project_page/" style="
    position: fixed;
    right: 0px;
    width: 100%;
    border: none;
    margin: 0;
    padding: 0;
    overflow: hidden;
    z-index: 999999;
    height: 100%;
  ">
</iframe>]]></content><author><name></name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">DiTFastAttn: Attention Compression for Diffusion Transformer Models</title><link href="https://nics-effalg.com/DiTFastAttn" rel="alternate" type="text/html" title="DiTFastAttn: Attention Compression for Diffusion Transformer Models" /><published>2024-06-07T00:00:00+00:00</published><updated>2024-06-07T00:00:00+00:00</updated><id>https://nics-effalg.com/DitFastAttn</id><content type="html" xml:base="https://nics-effalg.com/DiTFastAttn"><![CDATA[<p><span>
    <a class="custom_buttom" href="https://github.com/thu-nics/DiTFastAttn">
    Code
    </a>
    <a class="custom_buttom" href="https://arxiv.org/pdf/2406.08552">
    Paper
    </a>
</span></p>

<div style="flex: 1; flex-direction: column; padding: 20px;" class="project_card">
    Diffusion Transformers (DiT) have emerged as a powerful tool for image and video generation tasks. However, their quadratic computational complexity due to the self-attention mechanism poses a significant challenge, particularly for high-resolution and long video tasks. This paper mitigate the computational bottleneck in DiT models by introducing a novel post-training model compression method. We identify three key redundancies in the attention computation during DiT inference and we propose three techniques.
    <ul>
      <li><strong>Window Attention with Residual Caching</strong> - Reduces spatial redundancy.</li>
      <li><strong>Temporal Similarity Reduction</strong> - Exploit the similarity between steps.</li>
      <li><strong>Conditional Redundancy Elimination</strong> - Skips redundant computations during conditional generation.</li>
    </ul>
</div>

<h3 id="generation-speed-comparasion">Generation Speed Comparasion</h3>

<center>
	<figure style="width: 75%;">
    <img src="../assets/posts_images/DiTFastAttn_Demo_v3_complete.gif" alt="DiTFastAttn demo" />
  </figure>
</center>

<!-- <div style="flex: 1; padding: 10px; background-color: #f0f0f0; text-align: justify;">
  <p>We evaluate DiTFastAttn on three commonly used diffusion transformers: DiT and PixArt-Sigma for image generation tasks, and Open-Sora for video generation tasks. Our findings demonstrate that DiTFastAttn consistently reduces computational cost. Notably, the higher the image resolution, the greater the savings in computation and latency. For instance, with PixArt-Sigma, DiTFastAttn delivers a 36% to 88% reduction in attention computation and a latency decrease of up to 37% during the generation of 2K images.</p>
  <p>We experiment with different thresholds δ at intervals of 0.025, starting from 0.95. We denote these threshold settings as D1 (δ=0.975), D2 (δ=0.95), ..., D6 (δ=0.85), respectively.</p>
</div> -->

<h3 id="image-generation-results">Image Generation Results</h3>

<figure style="width: 95%; height: auto;">
    <img src="../assets/posts_images/pixart1k_result.png" alt="pixart1k_result" />
    <figcaption style="text-align: center;">1K Image Generation Results</figcaption>
</figure>

<figure style="width: 95%; height: auto;">
    <img src="../assets/posts_images/pixart2k_result.png" alt="DiTFastAttn_overview" />
    <figcaption style="text-align: center;">2K Image Generation Results</figcaption>
</figure>

<h3 id="video-generation-results">Video Generation Results</h3>

<div>   
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/raw.gif" alt="raw" />     
    <figcaption style="text-align: center;">Raw</figcaption>   
  </figure>    
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D1.gif" alt="D1" />     
    <figcaption style="text-align: center;">D1</figcaption>   
  </figure>    
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D2.gif" alt="D2" />     
    <figcaption style="text-align: center;">D2</figcaption>   
  </figure>
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D3.gif" alt="D3" />     
    <figcaption style="text-align: center;">D3</figcaption>   
  </figure>
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D4.gif" alt="D4" />     
    <figcaption style="text-align: center;">D4</figcaption>   
  </figure>
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D5.gif" alt="D5" />     
    <figcaption style="text-align: center;">D5</figcaption>   
  </figure>
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D6.gif" alt="D6" />     
    <figcaption style="text-align: center;">D6</figcaption>   
  </figure>
</div>

<div>   
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/raw2.gif" alt="raw" />     
    <figcaption style="text-align: center;">Raw</figcaption>   
  </figure>    
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D1_2.gif" alt="D1" />     
    <figcaption style="text-align: center;">D1</figcaption>   
  </figure>    
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D2_2.gif" alt="D2" />     
    <figcaption style="text-align: center;">D2</figcaption>   
  </figure>
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D3_2.gif" alt="D3" />     
    <figcaption style="text-align: center;">D3</figcaption>   
  </figure>
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D4_2.gif" alt="D4" />     
    <figcaption style="text-align: center;">D4</figcaption>   
  </figure>
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D5_2.gif" alt="D5" />     
    <figcaption style="text-align: center;">D5</figcaption>   
  </figure>
  <figure style="display: inline-block; width: 13%; height: auto">     
    <img src="../assets/posts_images/D6_2.gif" alt="D6" />     
    <figcaption style="text-align: center;">D6</figcaption>   
  </figure>
</div>

<h3 id="flops-reduction">FLOPs Reduction</h3>

<div style="text-align: center;">
      <img src="../assets/posts_images/DiTFastAttn_overview.jpg" alt="DiTFastAttn_overview" style="width: 50%; height: auto; min-width: 300px; display: block; margin-left: auto; margin-right: auto;" />
  </div>

<p>You can find the code for DiTFastAttn on GitHub at <a href="https://github.com/thu-nics/DiTFastAttn">DiTFastAttention</a>. Feel free to check out the repository for more details and to access the code.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Code Paper]]></summary></entry><entry><title type="html">An Introduction to Quantization of Large Language Models</title><link href="https://nics-effalg.com/QLLMIntro" rel="alternate" type="text/html" title="An Introduction to Quantization of Large Language Models" /><published>2023-08-30T00:00:00+00:00</published><updated>2023-08-30T00:00:00+00:00</updated><id>https://nics-effalg.com/QLLMIntro</id><content type="html" xml:base="https://nics-effalg.com/QLLMIntro"><![CDATA[<p><span>
    <a class="custom_buttom" href="https://www.bilibili.com/video/BV1zm4y1u72W/">
    Video
    </a>
    <a class="custom_buttom" href="../assets/ppt/2023-08-30-QLLMIntro.pdf">
    Slides
    </a>
</span></p>

<div style="flex: 1; flex-direction: column; padding: 20px;" class="project_card">
    The talk is in Chinese, and the English-version video will soon be available.
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[Video Slides]]></summary></entry><entry><title type="html">Model Compression Towards Efficient Deep Learning Inference</title><link href="https://nics-effalg.com/Compression" rel="alternate" type="text/html" title="Model Compression Towards Efficient Deep Learning Inference" /><published>2023-08-29T00:00:00+00:00</published><updated>2023-08-29T00:00:00+00:00</updated><id>https://nics-effalg.com/Compression</id><content type="html" xml:base="https://nics-effalg.com/Compression"><![CDATA[<p><span>
    <a class="custom_buttom" href="../assets/ppt/2023-08-29-Compression.pdf">
    Slides
    </a>
</span></p>]]></content><author><name></name></author><summary type="html"><![CDATA[Slides]]></summary></entry><entry><title type="html">Neural Architecture Search and Architecture Encoding</title><link href="https://nics-effalg.com/NAS" rel="alternate" type="text/html" title="Neural Architecture Search and Architecture Encoding" /><published>2022-12-12T00:00:00+00:00</published><updated>2022-12-12T00:00:00+00:00</updated><id>https://nics-effalg.com/NAS</id><content type="html" xml:base="https://nics-effalg.com/NAS"><![CDATA[<p><span>
    <a class="custom_buttom" href="../assets/ppt/2022-12-12-NAS.pdf">
    Slides
    </a>
</span></p>

<div style="flex: 1; flex-direction: column; padding: 20px;" class="project_card">
    This talk includes three parts:
    <ul>
      <li>First, I introduce the basics of neural architecture search (NAS), including background, motivation, problem definition.</li>
      <li>Then, I give an overview of NAS researches, organized by research questions, i.e., “what to search”, “how to search”, “what does the search tell us”. I summarize existing solutions to solve these questions.</li>
      <li>Finally, I introduce my researches that utilize architecture encoding to answer “how to search” along two directions, i.e., accelerate exploration and improve evaluation.</li>
    </ul>
    The talk is in Chinese, and the English-version video will soon be available.
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[Slides]]></summary></entry></feed>