i guess if you frame it as doing ReFT on output logits, then sure, it is sort of doing logits edit. but i am not totally sure if this is the best way to unify stuff.
\n","updatedAt":"2024-04-08T03:14:18.785Z","author":{"_id":"604fa06d4f9b833fc41c75cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png","fullname":"Zhengxuan Wu","name":"zhengxuanzenwu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9342923760414124},"editors":["zhengxuanzenwu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png"],"reactions":[],"isReport":false,"parentCommentId":"661020c8b5e6420d0394f28f"}}]},{"id":"66125fe2d7dfcea8ae5f39eb","author":{"_id":"615c231c3a60fa8486f80634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg","fullname":"David Faragó","name":"dball","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2024-04-07T08:57:06.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"This paper looks really promising, but I am having a hard time understanding what \"representation\" means. I didn't find a definition in the paper either. \n\nWhat comes closest to a definition is at the beginning of chapter 3, where representation seems to be a synonym for embedding (input tokens x_1,...,x_n are translated to representations h_1,...,h_n).\n\nThis is in line with MichaelBarryUK's comment. The author (zhengxuanzenwu) replies with saying prepresentations are \n\n1) model subcomponents or \n2) new components or \n3) Prefix-tune. \n\nUnfortunately, I don't understand 1, 2, and 3 either.","html":"This paper looks really promising, but I am having a hard time understanding what \"representation\" means. I didn't find a definition in the paper either.
\nWhat comes closest to a definition is at the beginning of chapter 3, where representation seems to be a synonym for embedding (input tokens x_1,...,x_n are translated to representations h_1,...,h_n).
\nThis is in line with MichaelBarryUK's comment. The author (zhengxuanzenwu) replies with saying prepresentations are
\n- \n
- model subcomponents or \n
- new components or \n
- Prefix-tune. \n
Unfortunately, I don't understand 1, 2, and 3 either.
\n","updatedAt":"2024-04-07T08:57:59.424Z","author":{"_id":"615c231c3a60fa8486f80634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg","fullname":"David Faragó","name":"dball","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9314879179000854},"editors":["dball"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg"],"reactions":[{"reaction":"➕","users":["shamikbose89","jaotheboss"],"count":2}],"isReport":false},"replies":[{"id":"6612675166d71ef25eb15fa0","author":{"_id":"604fa06d4f9b833fc41c75cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png","fullname":"Zhengxuan Wu","name":"zhengxuanzenwu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false},"createdAt":"2024-04-07T09:28:49.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Hey! Thanks for the question.\n\nYes, intervening representations are h_1,...,h_n. These are residual streams or block output at each layer at each token position. \n\nI also updated my previous answer trying to be clearer. Let me know if this makes sense tho.","html":"Hey! Thanks for the question.
\nYes, intervening representations are h_1,...,h_n. These are residual streams or block output at each layer at each token position.
\nI also updated my previous answer trying to be clearer. Let me know if this makes sense tho.
\n","updatedAt":"2024-04-07T09:30:38.405Z","author":{"_id":"604fa06d4f9b833fc41c75cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png","fullname":"Zhengxuan Wu","name":"zhengxuanzenwu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.9076452255249023},"editors":["zhengxuanzenwu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png"],"reactions":[{"reaction":"🔥","users":["zaursamedov1"],"count":1},{"reaction":"🚀","users":["zaursamedov1"],"count":1}],"isReport":false,"parentCommentId":"66125fe2d7dfcea8ae5f39eb"}},{"id":"661270154a55d90bafb80e14","author":{"_id":"615c231c3a60fa8486f80634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg","fullname":"David Faragó","name":"dball","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2024-04-07T10:06:13.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks a lot, that does help, though I am still struggling a bit with the term \"representation\". Could you please give a concise definition? That would be awesome.","html":"Thanks a lot, that does help, though I am still struggling a bit with the term \"representation\". Could you please give a concise definition? That would be awesome.
\n","updatedAt":"2024-04-07T10:06:13.227Z","author":{"_id":"615c231c3a60fa8486f80634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg","fullname":"David Faragó","name":"dball","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.964524507522583},"editors":["dball"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66125fe2d7dfcea8ae5f39eb"}},{"id":"6612723b4a55d90bafb89155","author":{"_id":"604fa06d4f9b833fc41c75cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png","fullname":"Zhengxuan Wu","name":"zhengxuanzenwu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false},"createdAt":"2024-04-07T10:15:23.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Let me try to elaborate a little.\n\nSo, given an input sequence x1,…,xn with n tokens, each layer of the transformer model computes a new hidden representation sequence h1,…,hn, with n representations mapping to n tokens respectively.\n\nOur method intervenes on a subset of h1,…,hn. For instance, the first 2 hs (h1 and h2), and the last 2 hs (hn-1, hn).","html":"Let me try to elaborate a little.
\nSo, given an input sequence x1,…,xn with n tokens, each layer of the transformer model computes a new hidden representation sequence h1,…,hn, with n representations mapping to n tokens respectively.
\nOur method intervenes on a subset of h1,…,hn. For instance, the first 2 hs (h1 and h2), and the last 2 hs (hn-1, hn).
\n","updatedAt":"2024-04-07T10:15:23.630Z","author":{"_id":"604fa06d4f9b833fc41c75cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png","fullname":"Zhengxuan Wu","name":"zhengxuanzenwu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7336530089378357},"editors":["zhengxuanzenwu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png"],"reactions":[{"reaction":"👍","users":["heroz-mtaketani"],"count":1}],"isReport":false,"parentCommentId":"66125fe2d7dfcea8ae5f39eb"}},{"id":"661299ab28d37b059e359026","author":{"_id":"615c231c3a60fa8486f80634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg","fullname":"David Faragó","name":"dball","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2024-04-07T13:03:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks a lot. So a representation is the output of a attention layer or the MLP layer of the transformer!?","html":"Thanks a lot. So a representation is the output of a attention layer or the MLP layer of the transformer!?
\n","updatedAt":"2024-04-07T13:03:39.837Z","author":{"_id":"615c231c3a60fa8486f80634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg","fullname":"David Faragó","name":"dball","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7296217083930969},"editors":["dball"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66125fe2d7dfcea8ae5f39eb"}},{"id":"661299abc75050b4376d97de","author":{"_id":"615c231c3a60fa8486f80634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg","fullname":"David Faragó","name":"dball","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2024-04-07T13:03:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks a lot. So a representation is the output of a attention layer or the MLP layer of the transformer!?","html":"Thanks a lot. So a representation is the output of a attention layer or the MLP layer of the transformer!?
\n","updatedAt":"2024-04-07T13:03:39.839Z","author":{"_id":"615c231c3a60fa8486f80634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg","fullname":"David Faragó","name":"dball","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7296217083930969},"editors":["dball"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66125fe2d7dfcea8ae5f39eb"}},{"id":"661299ad95c6b730658b0b46","author":{"_id":"615c231c3a60fa8486f80634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg","fullname":"David Faragó","name":"dball","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2024-04-07T13:03:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks a lot. So a representation is the output of a attention layer or the MLP layer of the transformer!?","html":"Thanks a lot. So a representation is the output of a attention layer or the MLP layer of the transformer!?
\n","updatedAt":"2024-04-07T13:03:41.782Z","author":{"_id":"615c231c3a60fa8486f80634","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg","fullname":"David Faragó","name":"dball","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7296217083930969},"editors":["dball"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/615c231c3a60fa8486f80634/t-kcY2gsYVcwrZrsTc0Fz.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66125fe2d7dfcea8ae5f39eb"}},{"id":"661360411fc707524d83ab84","author":{"_id":"604fa06d4f9b833fc41c75cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png","fullname":"Zhengxuan Wu","name":"zhengxuanzenwu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false},"createdAt":"2024-04-08T03:10:57.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"the representation we referred to in the paper is the output of a whole transformer block. so it is essentially `the MLP layer output + residual stream`.\n\nbut it will be also interesting to do ReFT on OV (i.e., attention out) or MLP_out.","html":"the representation we referred to in the paper is the output of a whole transformer block. so it is essentially the MLP layer output + residual stream.
but it will be also interesting to do ReFT on OV (i.e., attention out) or MLP_out.
\n","updatedAt":"2024-04-08T03:10:57.112Z","author":{"_id":"604fa06d4f9b833fc41c75cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png","fullname":"Zhengxuan Wu","name":"zhengxuanzenwu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9115161895751953},"editors":["zhengxuanzenwu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png"],"reactions":[{"reaction":"👍","users":["heroz-mtaketani"],"count":1}],"isReport":false,"parentCommentId":"66125fe2d7dfcea8ae5f39eb"}}]},{"id":"661fa444b5c3cd480752d270","author":{"_id":"5fff7edf6a2a91af974298c8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637335546726-5fff7edf6a2a91af974298c8.jpeg","fullname":"Shamik Bose","name":"shamikbose89","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false},"createdAt":"2024-04-17T10:28:20.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"@zhengxuanzenwu This is a very interesting paper! Great work! In the results section, it looks like a lot of methods outperform LoReFT on the Arithmetic reasoning task. Do you have a hypothesis for why this is? My initial thought is that maybe it's not as simple a task to capture mathematical representations in the output of attention layers?\nEDIT: In the paper, you say that the length of the generations might have something to do with this. Did you run any tests to see if the effectiveness of the method reduces as the length of generation increases?","html":"\n\n@zhengxuanzenwu\n\t This is a very interesting paper! Great work! In the results section, it looks like a lot of methods outperform LoReFT on the Arithmetic reasoning task. Do you have a hypothesis for why this is? My initial thought is that maybe it's not as simple a task to capture mathematical representations in the output of attention layers?
EDIT: In the paper, you say that the length of the generations might have something to do with this. Did you run any tests to see if the effectiveness of the method reduces as the length of generation increases?
\n\n@shamikbose89\n\t Thanks for your interest! Yes, LoReFT underperforms for arithmetic reasoning tasks, especially for GSM8K. In short, we don't know how to fix it yet. But here are a couple of hypotheses:
\n- \n
- Hyperparameter selection is not optimal. Although we tried hyperparameter tuning, our grid search is still pretty limited. We also haven't tried layerwise intervention weights sharing, etc. \n
- Intervening on decoding steps might help. Currently, we only intervene on the prompt. It is surprising that this is sufficient for the other two tasks with LLaMA models. For math reasonings which require CoT generations, intervening on decoding steps might help with long-form reasonings. \n
- More complex parameterization of the intervention. LoReFT is just one way of defining the intervention function. Coming up with more complex interventions could help. \n
Offline, we also tried to train and test on GSM8K only (the GSM8K dataset is also cleaner without GPT4 generated CoTs). LoReFT with Llama-2 still underperforms compared to LoRA + Llama-2 7B slightly (approximately 32% vs. 35%). However, LoReFT definitely has much fewer trainable parameters. See LoftQ paper for Llama-2 performance on GSM8K.
\nRe EDIT: Here, the generation length is shorter, since the golden label from GSM8K is shorter than GPT4 generated CoTs. Yea, could be interesting to look into this, since 32% vs. 35% is a smaller gap.
\np.s.. If you want to improve the math reasoning ability of ReFTs, I would also recommend to use GSM8K setup in the LoftQ paper. It is just much cleaner than the LLM-Adapter setup. We were doing this for the sake of benchmarking only.
\n","updatedAt":"2024-04-17T10:54:21.006Z","author":{"_id":"604fa06d4f9b833fc41c75cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png","fullname":"Zhengxuan Wu","name":"zhengxuanzenwu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":3,"identifiedLanguage":{"language":"en","probability":0.9374212026596069},"editors":["zhengxuanzenwu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png"],"reactions":[{"reaction":"👍","users":["shamikbose89"],"count":1}],"isReport":false,"parentCommentId":"661fa444b5c3cd480752d270"}}]},{"id":"662cd2a3d64b8a3bee4a5dd5","author":{"_id":"638eb5f949de7ae552dd6211","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638eb5f949de7ae552dd6211/mJkQJGpn9tXV37N2VLFCh.jpeg","fullname":"Derek Thomas","name":"derek-thomas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":131,"isUserFollowing":false},"createdAt":"2024-04-27T10:25:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@zhengxuanzenwu I am struggling to understand one aspect. I get the LoReFT equation but after that I dont understand why you would apply it selectively to various positions. Why not apply it to all positions and make it a layer-based operation? \n\nAlso have you done any tests to compare how this impacts latency?\n\nAwesome work, and thanks for sharing!!","html":"\n\n@zhengxuanzenwu\n\t I am struggling to understand one aspect. I get the LoReFT equation but after that I dont understand why you would apply it selectively to various positions. Why not apply it to all positions and make it a layer-based operation?
\nAlso have you done any tests to compare how this impacts latency?
\nAwesome work, and thanks for sharing!!
\n","updatedAt":"2024-04-27T10:25:39.455Z","author":{"_id":"638eb5f949de7ae552dd6211","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638eb5f949de7ae552dd6211/mJkQJGpn9tXV37N2VLFCh.jpeg","fullname":"Derek Thomas","name":"derek-thomas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":131,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9504525065422058},"editors":["derek-thomas"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/638eb5f949de7ae552dd6211/mJkQJGpn9tXV37N2VLFCh.jpeg"],"reactions":[{"reaction":"➕","users":["literate-goggles"],"count":1},{"reaction":"👍","users":["heroz-mtaketani"],"count":1}],"isReport":false},"replies":[{"id":"662d1620e666f70651cb7c69","author":{"_id":"604fa06d4f9b833fc41c75cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png","fullname":"Zhengxuan Wu","name":"zhengxuanzenwu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false},"createdAt":"2024-04-27T15:13:36.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@derek-thomas thanks for your interests! We tried hyperparameter-tuning on whether we share LoReFT weights across all positions for the prompt tokens. It seems like once we go up to certain threshold (and this is task dependent), performance does not increase. Intuitively, this may suggest that, editing every residual streams in the same way is not ideal (I.e., each position store information differently). On the other hand, if we don’t share weights across positions, the parameter count for LoReFT is going to be higher.\n\nWe did preliminary latency analysis in Appendix H. \n\nOne thing to note that is, the fact that we only intervene on the prompt token (i.e., only intervene on the KV cache) makes ReFT paradigm efficient. This is different from adaptors, where all decoding step has overhead. This is also different from LoRA weight merging before serving, since ReFT allows dynamic task-based interventions in batch. \n","html":"\n\n@derek-thomas\n\t thanks for your interests! We tried hyperparameter-tuning on whether we share LoReFT weights across all positions for the prompt tokens. It seems like once we go up to certain threshold (and this is task dependent), performance does not increase. Intuitively, this may suggest that, editing every residual streams in the same way is not ideal (I.e., each position store information differently). On the other hand, if we don’t share weights across positions, the parameter count for LoReFT is going to be higher.
\nWe did preliminary latency analysis in Appendix H.
\nOne thing to note that is, the fact that we only intervene on the prompt token (i.e., only intervene on the KV cache) makes ReFT paradigm efficient. This is different from adaptors, where all decoding step has overhead. This is also different from LoRA weight merging before serving, since ReFT allows dynamic task-based interventions in batch.
\n","updatedAt":"2024-04-27T15:13:36.468Z","author":{"_id":"604fa06d4f9b833fc41c75cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png","fullname":"Zhengxuan Wu","name":"zhengxuanzenwu","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.911983847618103},"editors":["zhengxuanzenwu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png"],"reactions":[{"reaction":"👍","users":["derek-thomas","heroz-mtaketani"],"count":2}],"isReport":false,"parentCommentId":"662cd2a3d64b8a3bee4a5dd5"}},{"id":"662e2205f42c11038b90d730","author":{"_id":"638eb5f949de7ae552dd6211","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638eb5f949de7ae552dd6211/mJkQJGpn9tXV37N2VLFCh.jpeg","fullname":"Derek Thomas","name":"derek-thomas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":131,"isUserFollowing":false},"createdAt":"2024-04-28T10:16:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks so much for your replies!!","html":"Thanks so much for your replies!!
\n","updatedAt":"2024-04-28T10:16:37.746Z","author":{"_id":"638eb5f949de7ae552dd6211","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638eb5f949de7ae552dd6211/mJkQJGpn9tXV37N2VLFCh.jpeg","fullname":"Derek Thomas","name":"derek-thomas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":131,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9679514169692993},"editors":["derek-thomas"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/638eb5f949de7ae552dd6211/mJkQJGpn9tXV37N2VLFCh.jpeg"],"reactions":[{"reaction":"❤️","users":["zhengxuanzenwu"],"count":1}],"isReport":false,"parentCommentId":"662cd2a3d64b8a3bee4a5dd5"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2404.03592","authors":[{"_id":"660f65b29ff4d088f566343e","user":{"_id":"604fa06d4f9b833fc41c75cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650432433568-604fa06d4f9b833fc41c75cf.png","isPro":true,"fullname":"Zhengxuan Wu","user":"zhengxuanzenwu","type":"user"},"name":"Zhengxuan Wu","status":"claimed_verified","statusLastChangedAt":"2024-04-06T15:19:32.200Z","hidden":false},{"_id":"660f65b29ff4d088f566343f","user":{"_id":"603803ad1b4a9bee818ab78e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/603803ad1b4a9bee818ab78e/CLWFX4lUD0hxJsV7uiiQE.jpeg","isPro":false,"fullname":"Aryaman Arora","user":"aryaman","type":"user"},"name":"Aryaman Arora","status":"claimed_verified","statusLastChangedAt":"2024-04-05T10:14:26.373Z","hidden":false},{"_id":"660f65b29ff4d088f5663440","name":"Zheng Wang","hidden":false},{"_id":"660f65b29ff4d088f5663441","user":{"_id":"627b2d0527dc4650b62eef42","avatarUrl":"/avatars/e70381850f5657b54e90f5539f3d74eb.svg","isPro":false,"fullname":"Atticus Geiger","user":"atticusg","type":"user"},"name":"Atticus Geiger","status":"claimed_verified","statusLastChangedAt":"2024-05-27T07:21:07.672Z","hidden":false},{"_id":"660f65b29ff4d088f5663442","name":"Dan Jurafsky","hidden":false},{"_id":"660f65b29ff4d088f5663443","user":{"_id":"5f555a1f8bf55658acfed19e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1628695407761-5f555a1f8bf55658acfed19e.jpeg","isPro":false,"fullname":"Christopher Manning","user":"manning","type":"user"},"name":"Christopher D. Manning","status":"admin_assigned","statusLastChangedAt":"2024-04-25T07:37:50.983Z","hidden":false},{"_id":"660f65b29ff4d088f5663444","name":"Christopher Potts","hidden":false}],"publishedAt":"2024-04-04T17:00:37.000Z","submittedOnDailyAt":"2024-04-05T01:15:06.930Z","title":"ReFT: Representation Finetuning for Language Models","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Parameter-efficient fine-tuning (PEFT) methods seek to adapt large models via\nupdates to a small number of weights. However, much prior interpretability work\nhas shown that representations encode rich semantic information, suggesting\nthat editing representations might be a more powerful alternative. Here, we\npursue this hypothesis by developing a family of Representation\nFinetuning (ReFT) methods. ReFT methods operate on a frozen base model and\nlearn task-specific interventions on hidden representations. We define a strong\ninstance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT). LoReFT is\na drop-in replacement for existing PEFTs and learns interventions that are\n10x-50x more parameter-efficient than prior state-of-the-art PEFTs. We showcase\nLoReFT on eight commonsense reasoning tasks, four arithmetic reasoning tasks,\nAlpaca-Eval v1.0, and GLUE. In all these evaluations, LoReFT delivers the best\nbalance of efficiency and performance, and almost always outperforms\nstate-of-the-art PEFTs. We release a generic ReFT training library publicly at\nhttps://github.com/stanfordnlp/pyreft.","upvotes":101,"discussionId":"660f65b29ff4d088f566346c","githubRepo":"https://github.com/stanfordnlp/pyreft","githubRepoAddedBy":"auto","ai_summary":"Representation Finetuning (ReFT) methods, exemplified by Low-rank Linear Subspace ReFT (LoReFT), achieve high efficiency and performance by adapting representations in frozen base models, outperforming state-of-the-art Parameter-efficient Fine-tuning (PEFT) methods.","ai_keywords":["Parameter-efficient fine-tuning","Representation Finetuning","ReFT","Low-rank Linear Subspace ReFT","LoReFT","hidden representations","commonsense reasoning","arithmetic reasoning","Alpaca-Eval","GLUE"],"githubStars":1557},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6101c620900eaa0057c2ce1d","avatarUrl":"/avatars/bd282166c120711c65b5409dc860ac58.svg","isPro":false,"fullname":"Abdel-Dayane Marcos","user":"admarcosai","type":"user"},{"_id":"6510382d9d93eec4a75b830a","avatarUrl":"/avatars/9df9d10e3b26b6cad316a31e5d58dbfe.svg","isPro":false,"fullname":"İrem Demirtaş","user":"iremdemirtas","type":"user"},{"_id":"5e0eed1ffcf41d740b699666","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e0eed1ffcf41d740b699666/jJnkTB9wsP4QBcIRZqZFD.jpeg","isPro":true,"fullname":"Blanc Swan","user":"blancsw","type":"user"},{"_id":"62deb6c3520a9fae78bb9bc3","avatarUrl":"/avatars/5d75fffa9bad36d20adb8f47141d1f0b.svg","isPro":false,"fullname":"Literate Goggles","user":"literate-goggles","type":"user"},{"_id":"626a76da4909b521e1f5f0ed","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/626a76da4909b521e1f5f0ed/DEe1N-eq7hhElmAPLvSgv.jpeg","isPro":false,"fullname":"Kevork Sulahian","user":"herooooooooo","type":"user"},{"_id":"6465dac7e9906a259f31ce9a","avatarUrl":"/avatars/a9bd8807a484822085a93302a37f08df.svg","isPro":false,"fullname":"lee dong ryeol","user":"drlee1","type":"user"},{"_id":"6264687d247eba6089346136","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650747379380-noauth.jpeg","isPro":false,"fullname":"Vadim Shubin","user":"LilOpa","type":"user"},{"_id":"63f887bcdf053017d1ad2967","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f887bcdf053017d1ad2967/bJ4KEqIK7TeG8iWOR7W2x.jpeg","isPro":false,"fullname":"Harry Mayne","user":"HarryMayne","type":"user"},{"_id":"64316678dec2a70d8130aa9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/EAS7OJwvyInle8J7IIBbw.jpeg","isPro":false,"fullname":"Levi Sverdlov","user":"Sverd","type":"user"},{"_id":"62d3940f5666d76902e21809","avatarUrl":"/avatars/b686279359053685e9d9785f01b91c12.svg","isPro":false,"fullname":"Sadao","user":"Ali313","type":"user"},{"_id":"6582292bc3fece72a64c90e5","avatarUrl":"/avatars/8e8de2e50ee3053b0cc8bacf63cd4124.svg","isPro":false,"fullname":"Atharva Nighot","user":"atharvanighot","type":"user"},{"_id":"630c2ddb86b8b9904c3860a6","avatarUrl":"/avatars/9b6cec2e9e269ccac1533eb7bf1ac2c5.svg","isPro":false,"fullname":"Igor Melnyk","user":"imelnyk","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">Abstract
Representation Finetuning (ReFT) methods, exemplified by Low-rank Linear Subspace ReFT (LoReFT), achieve high efficiency and performance by adapting representations in frozen base models, outperforming state-of-the-art Parameter-efficient Fine-tuning (PEFT) methods.
Parameter-efficient fine-tuning (PEFT) methods seek to adapt large models via updates to a small number of weights. However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative. Here, we pursue this hypothesis by developing a family of Representation Finetuning (ReFT) methods. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. We define a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT). LoReFT is a drop-in replacement for existing PEFTs and learns interventions that are 10x-50x more parameter-efficient than prior state-of-the-art PEFTs. We showcase LoReFT on eight commonsense reasoning tasks, four arithmetic reasoning tasks, Alpaca-Eval v1.0, and GLUE. In all these evaluations, LoReFT delivers the best balance of efficiency and performance, and almost always outperforms state-of-the-art PEFTs. We release a generic ReFT training library publicly at https://github.com/stanfordnlp/pyreft.
Community
As a developer, the key takeaway for me is: 7% greater accuracy, 27x fewer parameters, 18 minutes to train a 7b that can compete with GPT3.5. That's a nice step up in performance and efficiency.
But I'm totally confused as to how this actually works. Here's what I think I understand.
- During finetuning, we "simply" modify the inference output via a type of mask/filter.
- This mask is a type of contextualised embedding.
- During inference we simply pass the original output through our new learned mask
Am I right in my (abstract) understanding?
If so, there should be zero parameters modified... Hence my confusion
Thanks for your comments!
PEFTs update model subcomponents (e.g., layer weight diff), or new components (e.g., Adaptors), or some embeddings (e.g., Prefix embeddings).
So what ReFT does is to train interventions that intervene on the representations in the following steps:
- collect representations using hooks (as callback functions).
- learn a transformation function f, that applies to those representations.
- put them back into the computation graph.
The learnable parameters are in the function f. We provide one way to parameterize f in the paper, which we call it LoReFT. But, you can design your own transformation function.
This paper looks really promising, but I am having a hard time understanding what "representation" means. I didn't find a definition in the paper either.
What comes closest to a definition is at the beginning of chapter 3, where representation seems to be a synonym for embedding (input tokens x_1,...,x_n are translated to representations h_1,...,h_n).
This is in line with MichaelBarryUK's comment. The author (zhengxuanzenwu) replies with saying prepresentations are
- model subcomponents or
- new components or
- Prefix-tune.
Unfortunately, I don't understand 1, 2, and 3 either.
Hey! Thanks for the question.
Yes, intervening representations are h_1,...,h_n. These are residual streams or block output at each layer at each token position.
I also updated my previous answer trying to be clearer. Let me know if this makes sense tho.
@zhengxuanzenwu
This is a very interesting paper! Great work! In the results section, it looks like a lot of methods outperform LoReFT on the Arithmetic reasoning task. Do you have a hypothesis for why this is? My initial thought is that maybe it's not as simple a task to capture mathematical representations in the output of attention layers?
EDIT: In the paper, you say that the length of the generations might have something to do with this. Did you run any tests to see if the effectiveness of the method reduces as the length of generation increases?
@shamikbose89 Thanks for your interest! Yes, LoReFT underperforms for arithmetic reasoning tasks, especially for GSM8K. In short, we don't know how to fix it yet. But here are a couple of hypotheses:
- Hyperparameter selection is not optimal. Although we tried hyperparameter tuning, our grid search is still pretty limited. We also haven't tried layerwise intervention weights sharing, etc.
- Intervening on decoding steps might help. Currently, we only intervene on the prompt. It is surprising that this is sufficient for the other two tasks with LLaMA models. For math reasonings which require CoT generations, intervening on decoding steps might help with long-form reasonings.
- More complex parameterization of the intervention. LoReFT is just one way of defining the intervention function. Coming up with more complex interventions could help.
Offline, we also tried to train and test on GSM8K only (the GSM8K dataset is also cleaner without GPT4 generated CoTs). LoReFT with Llama-2 still underperforms compared to LoRA + Llama-2 7B slightly (approximately 32% vs. 35%). However, LoReFT definitely has much fewer trainable parameters. See LoftQ paper for Llama-2 performance on GSM8K.
Re EDIT: Here, the generation length is shorter, since the golden label from GSM8K is shorter than GPT4 generated CoTs. Yea, could be interesting to look into this, since 32% vs. 35% is a smaller gap.
p.s.. If you want to improve the math reasoning ability of ReFTs, I would also recommend to use GSM8K setup in the LoftQ paper. It is just much cleaner than the LLM-Adapter setup. We were doing this for the sake of benchmarking only.
@zhengxuanzenwu I am struggling to understand one aspect. I get the LoReFT equation but after that I dont understand why you would apply it selectively to various positions. Why not apply it to all positions and make it a layer-based operation?
Also have you done any tests to compare how this impacts latency?
Awesome work, and thanks for sharing!!
@derek-thomas thanks for your interests! We tried hyperparameter-tuning on whether we share LoReFT weights across all positions for the prompt tokens. It seems like once we go up to certain threshold (and this is task dependent), performance does not increase. Intuitively, this may suggest that, editing every residual streams in the same way is not ideal (I.e., each position store information differently). On the other hand, if we don’t share weights across positions, the parameter count for LoReFT is going to be higher.
We did preliminary latency analysis in Appendix H.
One thing to note that is, the fact that we only intervene on the prompt token (i.e., only intervene on the KV cache) makes ReFT paradigm efficient. This is different from adaptors, where all decoding step has overhead. This is also different from LoRA weight merging before serving, since ReFT allows dynamic task-based interventions in batch.
Models citing this paper 6
Browse 6 models citing this paperDatasets citing this paper 0
No dataset linking this paper