Hello! \n\n@Muennighoff\n\t pointed out to me that your Zephyr (and Xwin) results on AlpacaEval differ from those on the public leaderboard. In particular, we reported a win rate of 90.60% for Zephyr, but your table has 86.3%
\nIs this simply due to a different choice of generation parameters, i.e. did you use a different config to the one we added in the AlpacaEval repo (https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/models_configs/zephyr-7b-beta/configs.yaml)?
\n","updatedAt":"2023-12-28T22:22:37.913Z","author":{"_id":"5f0c746619cb630495b814fd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1594651707950-noauth.jpeg","fullname":"Lewis Tunstall","name":"lewtun","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1320,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6919828653335571},"editors":["lewtun"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1594651707950-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"658e1680f1aef46ec0d201a1","author":{"_id":"5f1eb362eec0ad2a071ad6e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f1eb362eec0ad2a071ad6e2/IXMYkYKuTwn6kBdWnQeeY.png","fullname":"Niklas Muennighoff","name":"Muennighoff","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":171,"isUserFollowing":false},"createdAt":"2023-12-29T00:44:48.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks! cc @hamishivi @yizhongw do you know why?","html":"Thanks! cc \n\n@hamishivi\n\t \n\n@yizhongw\n\t do you know why?
\n","updatedAt":"2023-12-29T00:44:48.125Z","author":{"_id":"5f1eb362eec0ad2a071ad6e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f1eb362eec0ad2a071ad6e2/IXMYkYKuTwn6kBdWnQeeY.png","fullname":"Niklas Muennighoff","name":"Muennighoff","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":171,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8006818890571594},"editors":["Muennighoff"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/5f1eb362eec0ad2a071ad6e2/IXMYkYKuTwn6kBdWnQeeY.png"],"reactions":[],"isReport":false}},{"id":"658f431cf0152a21fcc5bd35","author":{"_id":"62608fc2ffe8827cb1d89f9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png","fullname":"Hamish Ivison","name":"hamishivi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false},"createdAt":"2023-12-29T22:07:24.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"@lewtun Yeah, this is with greedy decoding, rather than 0.7 temperature - we used this for all the models we tested. I'll also note that I think there can be some variation (around 1-2 points) just from rerunning eval, probably due to GPT annotation non-determinism. ","html":"\n\n@lewtun\n\t Yeah, this is with greedy decoding, rather than 0.7 temperature - we used this for all the models we tested. I'll also note that I think there can be some variation (around 1-2 points) just from rerunning eval, probably due to GPT annotation non-determinism.
\n","updatedAt":"2023-12-29T22:07:37.518Z","author":{"_id":"62608fc2ffe8827cb1d89f9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png","fullname":"Hamish Ivison","name":"hamishivi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9599406719207764},"editors":["hamishivi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png"],"reactions":[{"reaction":"π","users":["Muennighoff"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2311.10702","authors":[{"_id":"655ac35bed8df831285547d4","user":{"_id":"62608fc2ffe8827cb1d89f9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png","isPro":false,"fullname":"Hamish Ivison","user":"hamishivi","type":"user"},"name":"Hamish Ivison","status":"claimed_verified","statusLastChangedAt":"2023-11-20T09:24:56.635Z","hidden":false},{"_id":"655ac35bed8df831285547d5","user":{"_id":"6269be67d1ac0cde592aba29","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6269be67d1ac0cde592aba29/EP3zZbD7-jWK9ITvHGbnZ.jpeg","isPro":false,"fullname":"Yizhong Wang","user":"yizhongw","type":"user"},"name":"Yizhong Wang","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:06:19.760Z","hidden":false},{"_id":"655ac35bed8df831285547d6","user":{"_id":"6556cff80e7a7067a934445f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6556cff80e7a7067a934445f/PoT7qQ6tVqLGGrYvGBbr8.jpeg","isPro":false,"fullname":"Valentina Pyatkin","user":"valpy","type":"user"},"name":"Valentina Pyatkin","status":"claimed_verified","statusLastChangedAt":"2023-11-21T10:24:40.183Z","hidden":false},{"_id":"655ac35bed8df831285547d7","user":{"_id":"628e5f90a9a3c754c1f7c88f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/628e5f90a9a3c754c1f7c88f/iWqMY_l6dalrgRaJZWbK3.png","isPro":false,"fullname":"Nathan Lambert","user":"natolambert","type":"user"},"name":"Nathan Lambert","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:07:02.283Z","hidden":false},{"_id":"655ac35bed8df831285547d8","user":{"_id":"63d0455a8cf6c8e4d2087eeb","avatarUrl":"/avatars/e86cb0ffbfa3c64f162945c40a2fdb89.svg","isPro":false,"fullname":"Matthew Peters","user":"matt-peters","type":"user"},"name":"Matthew Peters","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:07:50.930Z","hidden":false},{"_id":"655ac35bed8df831285547d9","user":{"_id":"6408fcc93461c51cf735a61e","avatarUrl":"/avatars/619f3653911d111f046a5a6c30fc8319.svg","isPro":false,"fullname":"Pradeep Dasigi","user":"pradeepd","type":"user"},"name":"Pradeep Dasigi","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:07:58.361Z","hidden":false},{"_id":"655ac35bed8df831285547da","user":{"_id":"613e1a9267835521a6816b04","avatarUrl":"/avatars/49edaa425bbce04dff92bbfb12a6b41c.svg","isPro":false,"fullname":"Joel Jang","user":"wkddydpf","type":"user"},"name":"Joel Jang","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:38:21.641Z","hidden":false},{"_id":"655ac35bed8df831285547db","user":{"_id":"5ff4e2a1463be69ae4bd42bd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1609884433338-5ff4e2a1463be69ae4bd42bd.jpeg","isPro":false,"fullname":"David Wadden","user":"dwadden","type":"user"},"name":"David Wadden","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:38:28.946Z","hidden":false},{"_id":"655ac35bed8df831285547dc","name":"Noah A. Smith","hidden":false},{"_id":"655ac35bed8df831285547dd","user":{"_id":"5ec07694ed25d76864d553f4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1589816744885-5ec07694ed25d76864d553f4.jpeg","isPro":false,"fullname":"Iz Beltagy","user":"beltagy","type":"user"},"name":"Iz Beltagy","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:39:16.902Z","hidden":false},{"_id":"655ac35bed8df831285547de","name":"Hannaneh Hajishirzi","hidden":false}],"publishedAt":"2023-11-17T18:45:45.000Z","submittedOnDailyAt":"2023-11-19T23:54:29.127Z","title":"Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Since the release of T\\\"ULU [Wang et al., 2023b], open resources for\ninstruction tuning have developed quickly, from better base models to new\nfinetuning techniques. We test and incorporate a number of these advances into\nT\\\"ULU, resulting in T\\\"ULU 2, a suite of improved T\\\"ULU models for advancing\nthe understanding and best practices of adapting pretrained language models to\ndownstream tasks and user preferences. Concretely, we release: (1)\nT\\\"ULU-V2-mix, an improved collection of high-quality instruction datasets; (2)\nT\\\"ULU 2, LLAMA-2 models finetuned on the V2 mixture; (3) T\\\"ULU 2+DPO, T\\\"ULU\n2 models trained with direct preference optimization (DPO), including the\nlargest DPO-trained model to date (T\\\"ULU 2+DPO 70B); (4) CODE T\\\"ULU 2, CODE\nLLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its\ninstruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple\nperspectives shows that the T\\\"ULU 2 suite achieves state-of-the-art\nperformance among open models and matches or exceeds the performance of\nGPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data,\ntraining and evaluation code to facilitate future open efforts on adapting\nlarge language models.","upvotes":19,"discussionId":"655ac35ded8df8312855481c","ai_summary":"T\\\"ULU 2, an advanced suite of language models, achieves state-of-the-art performance through improved datasets, fine-tuning techniques, and direct preference optimization, surpassing even GPT-3.5-turbo-0301 on several benchmarks.","ai_keywords":["instruction tuning","T\\\"ULU","T\\\"ULU-V2-mix","LLAMA-2","direct preference optimization (DPO)","CODE T\\\"ULU 2","CODE LLAMA","large language models"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6269be67d1ac0cde592aba29","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6269be67d1ac0cde592aba29/EP3zZbD7-jWK9ITvHGbnZ.jpeg","isPro":false,"fullname":"Yizhong Wang","user":"yizhongw","type":"user"},{"_id":"61e52be53d6dbb1da842316a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61e52be53d6dbb1da842316a/gx0WGPcOCClXPymoKglc4.jpeg","isPro":false,"fullname":"BΓΆrje Karlsson","user":"tellarin","type":"user"},{"_id":"62608fc2ffe8827cb1d89f9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png","isPro":false,"fullname":"Hamish Ivison","user":"hamishivi","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63e087b6a98d931aa90c1b9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e087b6a98d931aa90c1b9c/4ZnfL0U8rrj3cNhj7WTgo.jpeg","isPro":false,"fullname":"Hyunwoo Ko","user":"Cartinoe5930","type":"user"},{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},{"_id":"5f17f0a0925b9863e28ad517","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f17f0a0925b9863e28ad517/fXIY5i9RLsIa1v3CCuVtt.jpeg","isPro":true,"fullname":"Victor Mustar","user":"victor","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"},{"_id":"5dd96eb166059660ed1ee413","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/NQtzmrDdbG0H8qkZvRyGk.jpeg","isPro":true,"fullname":"Julien Chaumond","user":"julien-c","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"6311bca0ae8896941da24e66","avatarUrl":"/avatars/48de64894fc3c9397e26e4d6da3ff537.svg","isPro":false,"fullname":"Fynn KrΓΆger","user":"fynnkroeger","type":"user"},{"_id":"5f0c746619cb630495b814fd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1594651707950-noauth.jpeg","isPro":true,"fullname":"Lewis Tunstall","user":"lewtun","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3}">Abstract
T\"ULU 2, an advanced suite of language models, achieves state-of-the-art performance through improved datasets, fine-tuning techniques, and direct preference optimization, surpassing even GPT-3.5-turbo-0301 on several benchmarks.
Since the release of T\"ULU [Wang et al., 2023b], open resources for instruction tuning have developed quickly, from better base models to new finetuning techniques. We test and incorporate a number of these advances into T\"ULU, resulting in T\"ULU 2, a suite of improved T\"ULU models for advancing the understanding and best practices of adapting pretrained language models to downstream tasks and user preferences. Concretely, we release: (1) T\"ULU-V2-mix, an improved collection of high-quality instruction datasets; (2) T\"ULU 2, LLAMA-2 models finetuned on the V2 mixture; (3) T\"ULU 2+DPO, T\"ULU 2 models trained with direct preference optimization (DPO), including the largest DPO-trained model to date (T\"ULU 2+DPO 70B); (4) CODE T\"ULU 2, CODE LLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its instruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple perspectives shows that the T\"ULU 2 suite achieves state-of-the-art performance among open models and matches or exceeds the performance of GPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data, training and evaluation code to facilitate future open efforts on adapting large language models.
Community
I really love this kind of research or paper! I also agree with the opinion of the paper that future works should be conducted to analyze the behind things of several datasets, training methods, and base models. I hope that more research like this will progress!
Great to see the Zephyr recipe being battle tested on larger models!
Hello! @Muennighoff pointed out to me that your Zephyr (and Xwin) results on AlpacaEval differ from those on the public leaderboard. In particular, we reported a win rate of 90.60% for Zephyr, but your table has 86.3%
Is this simply due to a different choice of generation parameters, i.e. did you use a different config to the one we added in the AlpacaEval repo (https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/models_configs/zephyr-7b-beta/configs.yaml)?
@lewtun Yeah, this is with greedy decoding, rather than 0.7 temperature - we used this for all the models we tested. I'll also note that I think there can be some variation (around 1-2 points) just from rerunning eval, probably due to GPT annotation non-determinism.