$\"Screenshot$

Hello! \n\n@Muennighoff\n\t pointed out to me that your Zephyr (and Xwin) results on AlpacaEval differ from those on the public leaderboard. In particular, we reported a win rate of 90.60% for Zephyr, but your table has 86.3%

Is this simply due to a different choice of generation parameters, i.e. did you use a different config to the one we added in the AlpacaEval repo (https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/models_configs/zephyr-7b-beta/configs.yaml)?

\n","updatedAt":"2023-12-28T22:22:37.913Z","author":{"_id":"5f0c746619cb630495b814fd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1594651707950-noauth.jpeg","fullname":"Lewis Tunstall","name":"lewtun","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1320,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6919828653335571},"editors":["lewtun"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1594651707950-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"658e1680f1aef46ec0d201a1","author":{"_id":"5f1eb362eec0ad2a071ad6e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f1eb362eec0ad2a071ad6e2/IXMYkYKuTwn6kBdWnQeeY.png","fullname":"Niklas Muennighoff","name":"Muennighoff","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":171,"isUserFollowing":false},"createdAt":"2023-12-29T00:44:48.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks! cc @hamishivi @yizhongw do you know why?","html":"

Thanks! cc \n\n@hamishivi\n\t \n\n@yizhongw\n\t do you know why?

\n","updatedAt":"2023-12-29T00:44:48.125Z","author":{"_id":"5f1eb362eec0ad2a071ad6e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f1eb362eec0ad2a071ad6e2/IXMYkYKuTwn6kBdWnQeeY.png","fullname":"Niklas Muennighoff","name":"Muennighoff","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":171,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8006818890571594},"editors":["Muennighoff"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/5f1eb362eec0ad2a071ad6e2/IXMYkYKuTwn6kBdWnQeeY.png"],"reactions":[],"isReport":false}},{"id":"658f431cf0152a21fcc5bd35","author":{"_id":"62608fc2ffe8827cb1d89f9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png","fullname":"Hamish Ivison","name":"hamishivi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false},"createdAt":"2023-12-29T22:07:24.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"@lewtun Yeah, this is with greedy decoding, rather than 0.7 temperature - we used this for all the models we tested. I'll also note that I think there can be some variation (around 1-2 points) just from rerunning eval, probably due to GPT annotation non-determinism. ","html":"

\n\n@lewtun\n\t Yeah, this is with greedy decoding, rather than 0.7 temperature - we used this for all the models we tested. I'll also note that I think there can be some variation (around 1-2 points) just from rerunning eval, probably due to GPT annotation non-determinism.

\n","updatedAt":"2023-12-29T22:07:37.518Z","author":{"_id":"62608fc2ffe8827cb1d89f9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png","fullname":"Hamish Ivison","name":"hamishivi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9599406719207764},"editors":["hamishivi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png"],"reactions":[{"reaction":"👍","users":["Muennighoff"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2311.10702","authors":[{"_id":"655ac35bed8df831285547d4","user":{"_id":"62608fc2ffe8827cb1d89f9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png","isPro":false,"fullname":"Hamish Ivison","user":"hamishivi","type":"user"},"name":"Hamish Ivison","status":"claimed_verified","statusLastChangedAt":"2023-11-20T09:24:56.635Z","hidden":false},{"_id":"655ac35bed8df831285547d5","user":{"_id":"6269be67d1ac0cde592aba29","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6269be67d1ac0cde592aba29/EP3zZbD7-jWK9ITvHGbnZ.jpeg","isPro":false,"fullname":"Yizhong Wang","user":"yizhongw","type":"user"},"name":"Yizhong Wang","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:06:19.760Z","hidden":false},{"_id":"655ac35bed8df831285547d6","user":{"_id":"6556cff80e7a7067a934445f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6556cff80e7a7067a934445f/PoT7qQ6tVqLGGrYvGBbr8.jpeg","isPro":false,"fullname":"Valentina Pyatkin","user":"valpy","type":"user"},"name":"Valentina Pyatkin","status":"claimed_verified","statusLastChangedAt":"2023-11-21T10:24:40.183Z","hidden":false},{"_id":"655ac35bed8df831285547d7","user":{"_id":"628e5f90a9a3c754c1f7c88f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/628e5f90a9a3c754c1f7c88f/iWqMY_l6dalrgRaJZWbK3.png","isPro":false,"fullname":"Nathan Lambert","user":"natolambert","type":"user"},"name":"Nathan Lambert","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:07:02.283Z","hidden":false},{"_id":"655ac35bed8df831285547d8","user":{"_id":"63d0455a8cf6c8e4d2087eeb","avatarUrl":"/avatars/e86cb0ffbfa3c64f162945c40a2fdb89.svg","isPro":false,"fullname":"Matthew Peters","user":"matt-peters","type":"user"},"name":"Matthew Peters","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:07:50.930Z","hidden":false},{"_id":"655ac35bed8df831285547d9","user":{"_id":"6408fcc93461c51cf735a61e","avatarUrl":"/avatars/619f3653911d111f046a5a6c30fc8319.svg","isPro":false,"fullname":"Pradeep Dasigi","user":"pradeepd","type":"user"},"name":"Pradeep Dasigi","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:07:58.361Z","hidden":false},{"_id":"655ac35bed8df831285547da","user":{"_id":"613e1a9267835521a6816b04","avatarUrl":"/avatars/49edaa425bbce04dff92bbfb12a6b41c.svg","isPro":false,"fullname":"Joel Jang","user":"wkddydpf","type":"user"},"name":"Joel Jang","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:38:21.641Z","hidden":false},{"_id":"655ac35bed8df831285547db","user":{"_id":"5ff4e2a1463be69ae4bd42bd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1609884433338-5ff4e2a1463be69ae4bd42bd.jpeg","isPro":false,"fullname":"David Wadden","user":"dwadden","type":"user"},"name":"David Wadden","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:38:28.946Z","hidden":false},{"_id":"655ac35bed8df831285547dc","name":"Noah A. Smith","hidden":false},{"_id":"655ac35bed8df831285547dd","user":{"_id":"5ec07694ed25d76864d553f4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1589816744885-5ec07694ed25d76864d553f4.jpeg","isPro":false,"fullname":"Iz Beltagy","user":"beltagy","type":"user"},"name":"Iz Beltagy","status":"admin_assigned","statusLastChangedAt":"2023-11-20T12:39:16.902Z","hidden":false},{"_id":"655ac35bed8df831285547de","name":"Hannaneh Hajishirzi","hidden":false}],"publishedAt":"2023-11-17T18:45:45.000Z","submittedOnDailyAt":"2023-11-19T23:54:29.127Z","title":"Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Since the release of T\\\"ULU [Wang et al., 2023b], open resources for\ninstruction tuning have developed quickly, from better base models to new\nfinetuning techniques. We test and incorporate a number of these advances into\nT\\\"ULU, resulting in T\\\"ULU 2, a suite of improved T\\\"ULU models for advancing\nthe understanding and best practices of adapting pretrained language models to\ndownstream tasks and user preferences. Concretely, we release: (1)\nT\\\"ULU-V2-mix, an improved collection of high-quality instruction datasets; (2)\nT\\\"ULU 2, LLAMA-2 models finetuned on the V2 mixture; (3) T\\\"ULU 2+DPO, T\\\"ULU\n2 models trained with direct preference optimization (DPO), including the\nlargest DPO-trained model to date (T\\\"ULU 2+DPO 70B); (4) CODE T\\\"ULU 2, CODE\nLLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its\ninstruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple\nperspectives shows that the T\\\"ULU 2 suite achieves state-of-the-art\nperformance among open models and matches or exceeds the performance of\nGPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data,\ntraining and evaluation code to facilitate future open efforts on adapting\nlarge language models.","upvotes":19,"discussionId":"655ac35ded8df8312855481c","ai_summary":"T\\\"ULU 2, an advanced suite of language models, achieves state-of-the-art performance through improved datasets, fine-tuning techniques, and direct preference optimization, surpassing even GPT-3.5-turbo-0301 on several benchmarks.","ai_keywords":["instruction tuning","T\\\"ULU","T\\\"ULU-V2-mix","LLAMA-2","direct preference optimization (DPO)","CODE T\\\"ULU 2","CODE LLAMA","large language models"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6269be67d1ac0cde592aba29","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6269be67d1ac0cde592aba29/EP3zZbD7-jWK9ITvHGbnZ.jpeg","isPro":false,"fullname":"Yizhong Wang","user":"yizhongw","type":"user"},{"_id":"61e52be53d6dbb1da842316a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61e52be53d6dbb1da842316a/gx0WGPcOCClXPymoKglc4.jpeg","isPro":false,"fullname":"Börje Karlsson","user":"tellarin","type":"user"},{"_id":"62608fc2ffe8827cb1d89f9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png","isPro":false,"fullname":"Hamish Ivison","user":"hamishivi","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63e087b6a98d931aa90c1b9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e087b6a98d931aa90c1b9c/4ZnfL0U8rrj3cNhj7WTgo.jpeg","isPro":false,"fullname":"Hyunwoo Ko","user":"Cartinoe5930","type":"user"},{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},{"_id":"5f17f0a0925b9863e28ad517","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f17f0a0925b9863e28ad517/fXIY5i9RLsIa1v3CCuVtt.jpeg","isPro":true,"fullname":"Victor Mustar","user":"victor","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"},{"_id":"5dd96eb166059660ed1ee413","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/NQtzmrDdbG0H8qkZvRyGk.jpeg","isPro":true,"fullname":"Julien Chaumond","user":"julien-c","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"6311bca0ae8896941da24e66","avatarUrl":"/avatars/48de64894fc3c9397e26e4d6da3ff537.svg","isPro":false,"fullname":"Fynn Kröger","user":"fynnkroeger","type":"user"},{"_id":"5f0c746619cb630495b814fd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1594651707950-noauth.jpeg","isPro":true,"fullname":"Lewis Tunstall","user":"lewtun","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3}">

Papers

arxiv:2311.10702

Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2

Published on Nov 17, 2023

· Submitted by

AK on Nov 19, 2023

#3 Paper of the day

Upvote

Authors:

Hamish Ivison ,

Yizhong Wang ,

Valentina Pyatkin ,

Nathan Lambert ,

Matthew Peters ,

Pradeep Dasigi ,

Joel Jang ,

David Wadden ,

Iz Beltagy ,

Abstract

T\"ULU 2, an advanced suite of language models, achieves state-of-the-art performance through improved datasets, fine-tuning techniques, and direct preference optimization, surpassing even GPT-3.5-turbo-0301 on several benchmarks.

AI-generated summary

Since the release of T\"ULU [Wang et al., 2023b], open resources for instruction tuning have developed quickly, from better base models to new finetuning techniques. We test and incorporate a number of these advances into T\"ULU, resulting in T\"ULU 2, a suite of improved T\"ULU models for advancing the understanding and best practices of adapting pretrained language models to downstream tasks and user preferences. Concretely, we release: (1) T\"ULU-V2-mix, an improved collection of high-quality instruction datasets; (2) T\"ULU 2, LLAMA-2 models finetuned on the V2 mixture; (3) T\"ULU 2+DPO, T\"ULU 2 models trained with direct preference optimization (DPO), including the largest DPO-trained model to date (T\"ULU 2+DPO 70B); (4) CODE T\"ULU 2, CODE LLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its instruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple perspectives shows that the T\"ULU 2 suite achieves state-of-the-art performance among open models and matches or exceeds the performance of GPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data, training and evaluation code to facilitate future open efforts on adapting large language models.

View arXiv page View PDF Add to collection

Community

Cartinoe5930

Nov 20, 2023

I really love this kind of research or paper! I also agree with the opinion of the paper that future works should be conducted to analyze the behind things of several datasets, training methods, and base models. I hope that more research like this will progress!

edbeeching

Nov 21, 2023

Great to see the Zephyr recipe being battle tested on larger models!

lewtun

Dec 28, 2023

Hello! @Muennighoff pointed out to me that your Zephyr (and Xwin) results on AlpacaEval differ from those on the public leaderboard. In particular, we reported a win rate of 90.60% for Zephyr, but your table has 86.3%

Muennighoff

Dec 29, 2023

Thanks! cc @hamishivi @yizhongw do you know why?

hamishivi

Paper author Dec 29, 2023

•

edited Dec 29, 2023

@lewtun Yeah, this is with greedy decoding, rather than 0.7 temperature - we used this for all the models we tested. I'll also note that I think there can be some variation (around 1-2 points) just from rerunning eval, probably due to GPT annotation non-determinism.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 74

Browse 74 models citing this paper

Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2

Abstract

Community

Models citing this paper 74

Datasets citing this paper 4

Spaces citing this paper 89

Collections including this paper 5