https://github.com/ibm/deft

Datasets used:

\n","updatedAt":"2024-02-26T05:24:58.147Z","author":{"_id":"5f43448a79c1ba4c353d0d8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f43448a79c1ba4c353d0d8f/DiSygV3dn7A_OjmGVTrHD.jpeg","fullname":"Sugato Ray","name":"sugatoray","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":46,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6294185519218445},"editors":["sugatoray"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/5f43448a79c1ba4c353d0d8f/DiSygV3dn7A_OjmGVTrHD.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2402.01911","authors":[{"_id":"65dc1fc8ff744e508bd74332","user":{"_id":"60d4c068b102039f04db9021","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d4c068b102039f04db9021/cHgd6_cI_7MxAjJ9iG1e1.jpeg","isPro":false,"fullname":"Bharat Runwal","user":"bharatR","type":"user"},"name":"Bharat Runwal","status":"extracted_confirmed","statusLastChangedAt":"2024-02-26T05:29:21.737Z","hidden":false},{"_id":"65dc1fc8ff744e508bd74333","name":"Tejaswini Pedapati","hidden":false},{"_id":"65dc1fc8ff744e508bd74334","name":"Pin-Yu Chen","hidden":false}],"publishedAt":"2024-02-02T21:25:46.000Z","title":"From PEFT to DEFT: Parameter Efficient Finetuning for Reducing\n Activation Density in Transformers","summary":"Pretrained Language Models (PLMs) have become the de facto starting point for\nfine-tuning on downstream tasks. However, as model sizes continue to increase,\ntraditional fine-tuning of all parameters becomes challenging. To address this,\nparameter-efficient fine-tuning (PEFT) methods have gained popularity as a\nmeans to adapt PLMs effectively. In parallel, recent studies have revealed the\npresence of activation sparsity within the intermediate outputs of the\nmultilayer perception (MLP) blocks in transformers. Low activation density\nenables efficient model inference on sparsity-aware hardware. Building upon\nthis insight, in this work, we propose a novel density loss that encourages\nhigher activation sparsity (equivalently, lower activation density) in the\npre-trained models. We demonstrate the effectiveness of our approach by\nutilizing mainstream PEFT techniques including QLoRA, LoRA, Adapter,\nPrompt/Prefix Tuning to facilitate efficient model adaptation across diverse\ndownstream tasks. Experiments show that our proposed method DEFT,\nDensity-Efficient Fine-Tuning, can reduce the activation density consistently\nand up to 50.72% on RoBERTa_Large, and 53.19% (encoder density) and 90.60% (decoder density) on\nFlan-T5_XXL (11B) compared to PEFT using GLUE and QA\n(SQuAD) benchmarks respectively while maintaining competitive performance on\ndownstream tasks. We also showcase that DEFT works complementary with quantized\nand pruned models","upvotes":2,"discussionId":"65dc1fc9ff744e508bd74479","githubRepo":"https://github.com/ibm/deft","githubRepoAddedBy":"auto","ai_summary":"A density-efficient fine-tuning method enhances pre-trained language models by encouraging activation sparsity, improving inference efficiency without significantly affecting performance.","ai_keywords":["pretrained language models","parameter-efficient fine-tuning","activation sparsity","multilayer perception","MLP","transformers","density loss","QLoRA","LoRA","Adapter","Prompt/Prefix Tuning","DEFT","density-efficient fine-tuning","RoBERTa","Flan-T5","GLUE","QA","SQuAD","quantized models","pruned models"],"githubStars":7},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5f43448a79c1ba4c353d0d8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f43448a79c1ba4c353d0d8f/DiSygV3dn7A_OjmGVTrHD.jpeg","isPro":true,"fullname":"Sugato Ray","user":"sugatoray","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"}],"acceptLanguages":["*"]}">

arxiv:2402.01911

From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers

Published on Feb 2, 2024

Upvote

Authors:

Bharat Runwal ,

Abstract

A density-efficient fine-tuning method enhances pre-trained language models by encouraging activation sparsity, improving inference efficiency without significantly affecting performance.

AI-generated summary

Pretrained Language Models (PLMs) have become the de facto starting point for fine-tuning on downstream tasks. However, as model sizes continue to increase, traditional fine-tuning of all parameters becomes challenging. To address this, parameter-efficient fine-tuning (PEFT) methods have gained popularity as a means to adapt PLMs effectively. In parallel, recent studies have revealed the presence of activation sparsity within the intermediate outputs of the multilayer perception (MLP) blocks in transformers. Low activation density enables efficient model inference on sparsity-aware hardware. Building upon this insight, in this work, we propose a novel density loss that encourages higher activation sparsity (equivalently, lower activation density) in the pre-trained models. We demonstrate the effectiveness of our approach by utilizing mainstream PEFT techniques including QLoRA, LoRA, Adapter, Prompt/Prefix Tuning to facilitate efficient model adaptation across diverse downstream tasks. Experiments show that our proposed method DEFT, Density-Efficient Fine-Tuning, can reduce the activation density consistently and up to 50.72% on RoBERTa_Large, and 53.19% (encoder density) and 90.60% (decoder density) on Flan-T5_XXL (11B) compared to PEFT using GLUE and QA (SQuAD) benchmarks respectively while maintaining competitive performance on downstream tasks. We also showcase that DEFT works complementary with quantized and pruned models