Hi, What is the difference between instruction fine-tuning and instruction pre-training (in terms of training) discussed in the paper (except the fact that in IFT, we normally use parameter efficient techniques like LoRA only to update a portion of parameters)?
\n","updatedAt":"2024-06-21T22:28:44.171Z","author":{"_id":"62c06bff60edb2dd775b6e60","avatarUrl":"/avatars/bc7f1c46653e8947f25a2ed9c269bfbb.svg","fullname":"Sakuna Harinda","name":"Sakuna","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8844682574272156},"editors":["Sakuna"],"editorAvatarUrls":["/avatars/bc7f1c46653e8947f25a2ed9c269bfbb.svg"],"reactions":[],"isReport":false},"replies":[{"id":"6676242d8a4064c02b5a83eb","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-06-22T01:09:01.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi, thanks for your interest. Except for the pre-training data, Instruction Pre-Training keeps all other pre-training settings the same as Vanilla Pre-Training. In our experiment with instruction tuning, we tune all the parameters, but I think the PEFT method would also be applicable!","html":"Hi, thanks for your interest. Except for the pre-training data, Instruction Pre-Training keeps all other pre-training settings the same as Vanilla Pre-Training. In our experiment with instruction tuning, we tune all the parameters, but I think the PEFT method would also be applicable!
\n","updatedAt":"2024-06-22T01:09:01.490Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9388027787208557},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"6675fe9cc5ec2bdc94c4b26a"}},{"id":"667626b45eaa9dd29902a169","author":{"_id":"62c06bff60edb2dd775b6e60","avatarUrl":"/avatars/bc7f1c46653e8947f25a2ed9c269bfbb.svg","fullname":"Sakuna Harinda","name":"Sakuna","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-06-22T01:19:48.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the reply. So then the training procedure of instruction tuning and instruction pre-training is the same, right? ","html":"Thanks for the reply. So then the training procedure of instruction tuning and instruction pre-training is the same, right?
\n","updatedAt":"2024-06-22T01:19:48.725Z","author":{"_id":"62c06bff60edb2dd775b6e60","avatarUrl":"/avatars/bc7f1c46653e8947f25a2ed9c269bfbb.svg","fullname":"Sakuna Harinda","name":"Sakuna","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9471943378448486},"editors":["Sakuna"],"editorAvatarUrls":["/avatars/bc7f1c46653e8947f25a2ed9c269bfbb.svg"],"reactions":[],"isReport":false,"parentCommentId":"6675fe9cc5ec2bdc94c4b26a"}},{"id":"667630495eaa9dd29905fc70","author":{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","fullname":"Daixuan Cheng","name":"daixuancheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":17,"isUserFollowing":false},"createdAt":"2024-06-22T02:00:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi, since our focus is on the pre-training stage rather than the tuning stage, we maintain the tuning settings consistent with previous works. We compute the loss only on the output response part of each instruction-response pair. Additionally, the learning rate during the tuning stage is much smaller than that during pre-training. For example, we use a learning rate of 5e-6 for tuning and 3e-4 for pre-training. The other settings of tuning are the same as pre-training (as shown in Table 10 in Appendix).","html":"Hi, since our focus is on the pre-training stage rather than the tuning stage, we maintain the tuning settings consistent with previous works. We compute the loss only on the output response part of each instruction-response pair. Additionally, the learning rate during the tuning stage is much smaller than that during pre-training. For example, we use a learning rate of 5e-6 for tuning and 3e-4 for pre-training. The other settings of tuning are the same as pre-training (as shown in Table 10 in Appendix).
\n","updatedAt":"2024-06-22T02:00:41.207Z","author":{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","fullname":"Daixuan Cheng","name":"daixuancheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":17,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9579322934150696},"editors":["daixuancheng"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6675fe9cc5ec2bdc94c4b26a"}}]},{"id":"667623e966c4fa6d0c285d9b","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-06-22T01:07:53.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2024-06-22T01:08:35.759Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"66771c67c5786a0b1eb71363","author":{"_id":"5fc6879e1c5ee87b1164876d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg","fullname":"Huu Nguyen","name":"huu-ontocord","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":56,"isUserFollowing":false},"createdAt":"2024-06-22T18:48:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi - is the dataset you released the 200M one?\n","html":"Hi - is the dataset you released the 200M one?
\n","updatedAt":"2024-06-22T18:48:07.427Z","author":{"_id":"5fc6879e1c5ee87b1164876d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg","fullname":"Huu Nguyen","name":"huu-ontocord","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":56,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9673436284065247},"editors":["huu-ontocord"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg"],"reactions":[{"reaction":"π","users":["mrm8488"],"count":1}],"isReport":false},"replies":[{"id":"66777dcc9f2810b0094632ac","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-06-23T01:43:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi,\n\nThis is the dataset we use to train the instruction synthesizer. We've been thinking about how to upload the pre-training data (including the 200M instruction-response pairs), but the dataset is too largeπ€.","html":"Hi,
\nThis is the dataset we use to train the instruction synthesizer. We've been thinking about how to upload the pre-training data (including the 200M instruction-response pairs), but the dataset is too largeπ€.
\n","updatedAt":"2024-06-23T01:43:40.018Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9282596111297607},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"66771c67c5786a0b1eb71363"}},{"id":"6677a9810f4805115948e118","author":{"_id":"5fc6879e1c5ee87b1164876d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg","fullname":"Huu Nguyen","name":"huu-ontocord","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":56,"isUserFollowing":false},"createdAt":"2024-06-23T04:50:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"The 200m would be awesome to upload - preferable open license like CC-BY. It shouldn't be too much as shards of jsonl or parquett. For the non instruciton pretraining, if it's just public datasets, you can just point people to the recipe. IMO","html":"The 200m would be awesome to upload - preferable open license like CC-BY. It shouldn't be too much as shards of jsonl or parquett. For the non instruciton pretraining, if it's just public datasets, you can just point people to the recipe. IMO
\n","updatedAt":"2024-06-23T04:50:09.661Z","author":{"_id":"5fc6879e1c5ee87b1164876d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg","fullname":"Huu Nguyen","name":"huu-ontocord","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":56,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9329466819763184},"editors":["huu-ontocord"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66771c67c5786a0b1eb71363"}},{"id":"6677b729d0970f1efa0915d1","author":{"_id":"650801ced5578ef7e20b33d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650801ced5578ef7e20b33d4/oLptSnKMecbu62EgglmO6.png","fullname":"AdaptLLM","name":"AdaptLLM","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":275,"isUserFollowing":false},"createdAt":"2024-06-23T05:48:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the suggestion, we will consider open-sourcing the necessary parts.","html":"Thanks for the suggestion, we will consider open-sourcing the necessary parts.
\n","updatedAt":"2024-06-23T05:48:25.077Z","author":{"_id":"650801ced5578ef7e20b33d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650801ced5578ef7e20b33d4/oLptSnKMecbu62EgglmO6.png","fullname":"AdaptLLM","name":"AdaptLLM","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":275,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.926403284072876},"editors":["AdaptLLM"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/650801ced5578ef7e20b33d4/oLptSnKMecbu62EgglmO6.png"],"reactions":[],"isReport":false,"parentCommentId":"66771c67c5786a0b1eb71363"}}]},{"id":"6678e83a5f7d5c8af789a9c7","author":{"_id":"61bd6bbb92dd56338582ebbd","avatarUrl":"/avatars/3637ab04f9342cf15340a47c9645722a.svg","fullname":"Imam Nur Bani Yusuf","name":"imamnurby","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2024-06-24T03:30:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi, thanks for your work! I get few interesting insights from this.\n\n1. Based on your results, can I say that we can replace \"pretrain on the raw corpora -> instruction tuning\" with \"instruction tuning\" directly? I do not see large differences between typical instruction tuning with your proposed instruction pretraining, except that in your approach, you directly train using the instruction-response pairs.\n\n2. Do you perform any verifications to the generated instruction-response pairs?","html":"Hi, thanks for your work! I get few interesting insights from this.
\n- \n
Based on your results, can I say that we can replace \"pretrain on the raw corpora -> instruction tuning\" with \"instruction tuning\" directly? I do not see large differences between typical instruction tuning with your proposed instruction pretraining, except that in your approach, you directly train using the instruction-response pairs.
\n \nDo you perform any verifications to the generated instruction-response pairs?
\n \n
Hi,
\nThanks for your question!
\nQ1: Can I say that we can replace \"pretrain on the raw corpora -> instruction tuning\" with \"instruction tuning\" directly?
\nThis is a promising approach worth trying. However, it may come with two limitations:
\n- \n
- Lack of Knowledge Source: Instruction Pre-training does not train on the instruction-response pairs alone. Instead, it trains on the concatenation of raw text and synthesized pairs, formatting the context-based task completion (e.g., reading comprehension), hoping to learn the knowledge embedded in the raw text. Vanilla instruction tuning (those without the raw text) tends to teach the pre-trained base model to follow instructions rather than learning new knowledge. \n
For example, in the following image, vanilla pre-training trains on the raw texts, instruction tuning trains on instruction-response pairs, whereas instruction pre-training trains on the instruction-augmented texts.
- \n
- Data Limitation: As far as I know, the existing datasets for instruction tuning are significantly smaller than those available for pre-training. \n
Q2: Do you perform any verifications of the generated instruction-response pairs?
\nYes, in section 5 of our paper, we have checked the synthesized pairs in terms of context relevance, response accuracy, and task diversity.
\n\n","updatedAt":"2024-06-24T04:02:42.382Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8317918181419373},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"6678e83a5f7d5c8af789a9c7"}}]},{"id":"66793eae82f924034c47f634","author":{"_id":"630475266dbbb80f1634cb9c","avatarUrl":"/avatars/2e11ccfecfa18dbef673c759102ed962.svg","fullname":"Phy","name":"s-JoL","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":16,"isUserFollowing":false},"createdAt":"2024-06-24T09:38:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"In Table 3, the result of instruct-pt on MED is 61.3 and fin is 74.7, but in Table 4, the results are reversed. Is this a mistake?","html":"In Table 3, the result of instruct-pt on MED is 61.3 and fin is 74.7, but in Table 4, the results are reversed. Is this a mistake?
\n","updatedAt":"2024-06-24T09:38:54.821Z","author":{"_id":"630475266dbbb80f1634cb9c","avatarUrl":"/avatars/2e11ccfecfa18dbef673c759102ed962.svg","fullname":"Phy","name":"s-JoL","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":16,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9573521018028259},"editors":["s-JoL"],"editorAvatarUrls":["/avatars/2e11ccfecfa18dbef673c759102ed962.svg"],"reactions":[],"isReport":false},"replies":[{"id":"667958df35f7ad9125259887","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-06-24T11:30:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for your careful review. The domain names in Table 4 should be reversed.","html":"Thanks for your careful review. The domain names in Table 4 should be reversed.
\n","updatedAt":"2024-06-24T11:30:39.001Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9076557755470276},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"66793eae82f924034c47f634"}}]},{"id":"6679e810f1cc6ceabe052376","author":{"_id":"630fff3f02ce39336c495fe9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630fff3f02ce39336c495fe9/CZmQtRB4eGVbRBYT3_IH3.png","fullname":"Sam McLeod","name":"smcleod","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false},"createdAt":"2024-06-24T21:41:36.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Have you given any thought to training a version on a higher ratio of code?\n\nI am interested in finding a way to generate more up-to-date datasets on code libraries, examples etc... as many of today's LLMs are using quite dated knowledge thus often result in generating code with deprecated libraries and patterns. Just wondering if a more code-tuned version of this might do the trick.","html":"Have you given any thought to training a version on a higher ratio of code?
\nI am interested in finding a way to generate more up-to-date datasets on code libraries, examples etc... as many of today's LLMs are using quite dated knowledge thus often result in generating code with deprecated libraries and patterns. Just wondering if a more code-tuned version of this might do the trick.
\n","updatedAt":"2024-06-24T21:41:36.157Z","author":{"_id":"630fff3f02ce39336c495fe9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630fff3f02ce39336c495fe9/CZmQtRB4eGVbRBYT3_IH3.png","fullname":"Sam McLeod","name":"smcleod","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9268738031387329},"editors":["smcleod"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/630fff3f02ce39336c495fe9/CZmQtRB4eGVbRBYT3_IH3.png"],"reactions":[],"isReport":false},"replies":[{"id":"667a22a742ef3dbfae2d2fc7","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-06-25T01:51:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi,\n\nThanks for your suggestion. You could try using our instruction synthesizer to generate instruction-response pairs based on new code materials, such as updated code textbooks. We've observed that the instruction synthesizer can generate relevant code tasks when the input text is related to the coding domain. This approach might help in creating more up-to-date datasets for code libraries and examples.","html":"Hi,
\nThanks for your suggestion. You could try using our instruction synthesizer to generate instruction-response pairs based on new code materials, such as updated code textbooks. We've observed that the instruction synthesizer can generate relevant code tasks when the input text is related to the coding domain. This approach might help in creating more up-to-date datasets for code libraries and examples.
\n","updatedAt":"2024-06-25T01:51:35.865Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8864392638206482},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"6679e810f1cc6ceabe052376"}}]},{"id":"669a65816adaedf897f2923c","author":{"_id":"64b747aff902508f0d782068","avatarUrl":"/avatars/023a36d8a6b7e7b49dc63ad9ac030b36.svg","fullname":"Le Van Duc","name":"levanduc","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2024-07-19T13:09:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Cann you share the configurations, computation and time for instruction-response pairs generation with each raw dataset as well as for the pre-instruction training? It would help for us to work with new domains of dataset. Thank you!","html":"Cann you share the configurations, computation and time for instruction-response pairs generation with each raw dataset as well as for the pre-instruction training? It would help for us to work with new domains of dataset. Thank you!
\n","updatedAt":"2024-07-19T13:09:21.313Z","author":{"_id":"64b747aff902508f0d782068","avatarUrl":"/avatars/023a36d8a6b7e7b49dc63ad9ac030b36.svg","fullname":"Le Van Duc","name":"levanduc","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9425191879272461},"editors":["levanduc"],"editorAvatarUrls":["/avatars/023a36d8a6b7e7b49dc63ad9ac030b36.svg"],"reactions":[],"isReport":false},"replies":[{"id":"669a6e7f2dbf53ccd206d852","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-07-19T13:47:43.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Thanks for the question. We've added pre-training suggestions in the `Advanced Usage` section of our [instruction synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer).\n\nUsing the vLLM inference code, on a single A100-80GB GPU, it takes about 1 day to synthesize instruction-response pairs for 1 billion tokens of raw corpora.\n\nFor domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) at a 1:1 ratio (counted by tokens).\n\nOther training details are presented in Table 10 in the Appendix of our paper:\n\n\n\n","html":"Thanks for the question. We've added pre-training suggestions in the Advanced Usage section of our instruction synthesizer.
Using the vLLM inference code, on a single A100-80GB GPU, it takes about 1 day to synthesize instruction-response pairs for 1 billion tokens of raw corpora.
\nFor domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from OpenOrca at a 1:1 ratio (counted by tokens).
\nOther training details are presented in Table 10 in the Appendix of our paper:
\n\n","updatedAt":"2024-09-04T02:49:07.223Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7829620838165283},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[{"reaction":"β€οΈ","users":["levanduc"],"count":1}],"isReport":false,"parentCommentId":"669a65816adaedf897f2923c"}}]},{"id":"66ab819173f8601ad1c50114","author":{"_id":"64f4c8739ee58d48e8507e0e","avatarUrl":"/avatars/4be540dfb4a949f37cba2d3c3729fbde.svg","fullname":"Dmitrii Stoianov","name":"heylimon","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false},"createdAt":"2024-08-01T12:37:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi! From what I understand, you train the model by computing the loss on all tokens. Have you tried training the model by computing the loss only on the answers, as is commonly done during the Alignment SFT phase? Have you conducted any ablation studies on this? Or do you have any insights on why it might be better to compute the loss on all tokens?","html":"Hi! From what I understand, you train the model by computing the loss on all tokens. Have you tried training the model by computing the loss only on the answers, as is commonly done during the Alignment SFT phase? Have you conducted any ablation studies on this? Or do you have any insights on why it might be better to compute the loss on all tokens?
\n","updatedAt":"2024-08-01T12:37:37.524Z","author":{"_id":"64f4c8739ee58d48e8507e0e","avatarUrl":"/avatars/4be540dfb4a949f37cba2d3c3729fbde.svg","fullname":"Dmitrii Stoianov","name":"heylimon","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9722594618797302},"editors":["heylimon"],"editorAvatarUrls":["/avatars/4be540dfb4a949f37cba2d3c3729fbde.svg"],"reactions":[],"isReport":false},"replies":[{"id":"66ab8d47386308c8662cdfeb","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-08-01T13:27:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi, thanks for the question. We have not tried computing the loss only on the answers yet for two reasons:\n\n- **Keeping It Simple:** We want our method to be very easy for everyone (including us) to use. By computing the loss on all tokens, we can just convert the data and train it like *Vanilla Pre-training* with familiar training setups.\n\n- **Fair Comparison:** If we only compute the loss on the answers, the number of tokens the model learns from is much smaller. So, if we run *Vanilla Pre-training* and *Instruction Pre-training* for the same number of steps with the same batch size, *Instruction Pre-training* ends up with fewer trained tokens, making it an unfair comparison.\n\nAnyway, itβs a promising idea and worth trying since it has worked well in instruction tuning.","html":"Hi, thanks for the question. We have not tried computing the loss only on the answers yet for two reasons:
\n- \n
Keeping It Simple: We want our method to be very easy for everyone (including us) to use. By computing the loss on all tokens, we can just convert the data and train it like Vanilla Pre-training with familiar training setups.
\n \nFair Comparison: If we only compute the loss on the answers, the number of tokens the model learns from is much smaller. So, if we run Vanilla Pre-training and Instruction Pre-training for the same number of steps with the same batch size, Instruction Pre-training ends up with fewer trained tokens, making it an unfair comparison.
\n \n
Anyway, itβs a promising idea and worth trying since it has worked well in instruction tuning.
\n","updatedAt":"2024-08-01T13:27:35.568Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9240208268165588},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[{"reaction":"π","users":["heylimon","iqbalamo93"],"count":2}],"isReport":false,"parentCommentId":"66ab819173f8601ad1c50114"}},{"id":"66d0d6ff1f9d35fb75fa3135","author":{"_id":"630d00c43dc31beba6ef1e05","avatarUrl":"/avatars/d336fd693cfd9e7d4864ee51e664e619.svg","fullname":"Isingh","name":"iqbalamo93","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2024-08-29T20:15:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the first response; it clarifies the doubt. I guess the confusion is stemming from the line in the paper that says, 'Additionally, we calculate the tuning loss only on the instruction-response pairs to guide the model to focus on these pairs.' Can you please help me understand if this tuning loss is a part of the pre-training stage? or where does this tuning loss fits in?","html":"Thanks for the first response; it clarifies the doubt. I guess the confusion is stemming from the line in the paper that says, 'Additionally, we calculate the tuning loss only on the instruction-response pairs to guide the model to focus on these pairs.' Can you please help me understand if this tuning loss is a part of the pre-training stage? or where does this tuning loss fits in?
\n","updatedAt":"2024-08-29T20:15:59.541Z","author":{"_id":"630d00c43dc31beba6ef1e05","avatarUrl":"/avatars/d336fd693cfd9e7d4864ee51e664e619.svg","fullname":"Isingh","name":"iqbalamo93","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.921776533126831},"editors":["iqbalamo93"],"editorAvatarUrls":["/avatars/d336fd693cfd9e7d4864ee51e664e619.svg"],"reactions":[],"isReport":false,"parentCommentId":"66ab819173f8601ad1c50114"}},{"id":"66d125f3d51528a0386244c1","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-08-30T01:52:51.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"> Can you please help me understand if this tuning loss is a part of the pre-training stage? or where does this tuning loss fits in?\n\nHi, the tuning loss is for tuning the instruction synthesizer, not for the pre-training stage. Specifically, we first tune the instruction synthesizer (where the tuning loss fits in), which is then used to augment the pre-training corpora. Afterward, we use the augmented corpora to pre-train a language model.","html":"\n\nCan you please help me understand if this tuning loss is a part of the pre-training stage? or where does this tuning loss fits in?
\n
Hi, the tuning loss is for tuning the instruction synthesizer, not for the pre-training stage. Specifically, we first tune the instruction synthesizer (where the tuning loss fits in), which is then used to augment the pre-training corpora. Afterward, we use the augmented corpora to pre-train a language model.
\n","updatedAt":"2024-08-30T01:52:51.325Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9006796479225159},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[{"reaction":"β€οΈ","users":["iqbalamo93"],"count":1}],"isReport":false,"parentCommentId":"66ab819173f8601ad1c50114"}}]},{"id":"66ab8cdef7c47d974ece613f","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-08-01T13:25:50.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2024-08-01T13:27:12.831Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"66d0352f85fb9cb99f101c73","author":{"_id":"631c418c71f8e7137df05f3e","avatarUrl":"/avatars/0157a31fb19b618fefa8e7c88227aeca.svg","fullname":"yeontaek oh ","name":"yeontaek","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false},"createdAt":"2024-08-29T08:45:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hello, I reviewed the paper you suggested. Since I havenβt read through all the content of the paper, I am a bit confused.\n\nIn the general-instruction-augmented-corpora, it seems that bos and eos tokens are not used.\n\nHowever, in the ft-instruction-synthesizer-collection, it seems that bos and eos tokens are used to structure the data.\n\nIs there a difference between the two?","html":"Hello, I reviewed the paper you suggested. Since I havenβt read through all the content of the paper, I am a bit confused.
\nIn the general-instruction-augmented-corpora, it seems that bos and eos tokens are not used.
\nHowever, in the ft-instruction-synthesizer-collection, it seems that bos and eos tokens are used to structure the data.
\nIs there a difference between the two?
\n","updatedAt":"2024-08-29T08:45:35.490Z","author":{"_id":"631c418c71f8e7137df05f3e","avatarUrl":"/avatars/0157a31fb19b618fefa8e7c88227aeca.svg","fullname":"yeontaek oh ","name":"yeontaek","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9442613124847412},"editors":["yeontaek"],"editorAvatarUrls":["/avatars/0157a31fb19b618fefa8e7c88227aeca.svg"],"reactions":[],"isReport":false},"replies":[{"id":"66d03cc8ed0167bec30807bf","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-08-29T09:18:00.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Thanks for your question.\n\n**Q1: In the general-instruction-augmented-corpora, it seems that BOS and EOS tokens are not used.**\n\nBOS and EOS tokens are indeed used during pre-training. Although these tokens are not shown in the released dataset, our pre-training code automatically adds BOS and EOS token IDs during tokenization.\n\nFollowing GPT-3, we pack multiple instruction-augmented texts into one sequence until the maximum sequence length is reached. Suppose the tokenized IDs for a single templatified instruction-augmented text (which represents an M-shot example) are `TEXT_IDs_N`:\n\nAn input sequence for pre-training would look like this:\n`bos_token_id, TEXT_IDs_1, eos_token_id, TEXT_IDs_2, eos_token_id, ... , eos_token_id, TEXT_IDs_N, eos_token_id`\n\n**Q2: In the ft-instruction-synthesizer-collection, it seems that BOS and EOS tokens are used to structure the data.**\n\nWhen fine-tuning the instruction synthesizer, BOS and EOS tokens are also used to separate examples. Since we combined multiple examples to create a few-shot setting, we explicitly denote the presence of BOS and EOS in the script to ensure users remember to add them. Unlike pre-training, we donβt need to add BOS and EOS token IDs during tokenization in fine-tuning.\n\nOverall, BOS and EOS tokens are utilized in both fine-tuning the instruction synthesizer and pre-training the LMs, but in different ways.\n","html":"Thanks for your question.
\nQ1: In the general-instruction-augmented-corpora, it seems that BOS and EOS tokens are not used.
\nBOS and EOS tokens are indeed used during pre-training. Although these tokens are not shown in the released dataset, our pre-training code automatically adds BOS and EOS token IDs during tokenization.
\nFollowing GPT-3, we pack multiple instruction-augmented texts into one sequence until the maximum sequence length is reached. Suppose the tokenized IDs for a single templatified instruction-augmented text (which represents an M-shot example) are TEXT_IDs_N:
An input sequence for pre-training would look like this:bos_token_id, TEXT_IDs_1, eos_token_id, TEXT_IDs_2, eos_token_id, ... , eos_token_id, TEXT_IDs_N, eos_token_id
Q2: In the ft-instruction-synthesizer-collection, it seems that BOS and EOS tokens are used to structure the data.
\nWhen fine-tuning the instruction synthesizer, BOS and EOS tokens are also used to separate examples. Since we combined multiple examples to create a few-shot setting, we explicitly denote the presence of BOS and EOS in the script to ensure users remember to add them. Unlike pre-training, we donβt need to add BOS and EOS token IDs during tokenization in fine-tuning.
\nOverall, BOS and EOS tokens are utilized in both fine-tuning the instruction synthesizer and pre-training the LMs, but in different ways.
\n","updatedAt":"2024-08-29T09:41:58.092Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.9180124402046204},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"66d0352f85fb9cb99f101c73"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2406.14491","authors":[{"_id":"6674ef905f7d5c8af70b5607","user":{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","isPro":false,"fullname":"Daixuan Cheng","user":"daixuancheng","type":"user"},"name":"Daixuan Cheng","status":"claimed_verified","statusLastChangedAt":"2024-06-21T07:05:43.020Z","hidden":false},{"_id":"6674ef905f7d5c8af70b5608","user":{"_id":"624ac662102fcdff87be51b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624ac662102fcdff87be51b9/rzNahZFFkp194170tactJ.jpeg","isPro":false,"fullname":"Yuxian Gu","user":"t1101675","type":"user"},"name":"Yuxian Gu","status":"admin_assigned","statusLastChangedAt":"2024-06-21T11:12:42.585Z","hidden":false},{"_id":"6674ef905f7d5c8af70b5609","user":{"_id":"632bd2f72d6a805eeb4bc601","avatarUrl":"/avatars/6e1533e8a599f3068290aa69ac82cab7.svg","isPro":false,"fullname":"HUANG SHAOHAN","user":"buaahsh","type":"user"},"name":"Shaohan Huang","status":"admin_assigned","statusLastChangedAt":"2024-06-21T11:12:55.699Z","hidden":false},{"_id":"6674ef905f7d5c8af70b560a","name":"Junyu Bi","hidden":false},{"_id":"6674ef905f7d5c8af70b560b","name":"Minlie Huang","hidden":false},{"_id":"6674ef905f7d5c8af70b560c","user":{"_id":"6368c512fbfe97c16a40baba","avatarUrl":"/avatars/1c23bc7c0b6d9225699ce27647623d7a.svg","isPro":false,"fullname":"Furu Wei","user":"thegenerality","type":"user"},"name":"Furu Wei","status":"admin_assigned","statusLastChangedAt":"2024-06-21T11:13:45.390Z","hidden":false}],"publishedAt":"2024-06-20T16:55:33.000Z","submittedOnDailyAt":"2024-06-21T01:42:56.560Z","title":"Instruction Pre-Training: Language Models are Supervised Multitask\n Learners","submittedOnDailyBy":{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","isPro":false,"fullname":"Daixuan Cheng","user":"daixuancheng","type":"user"},"summary":"Unsupervised multitask pre-training has been the critical method behind the\nrecent success of language models (LMs). However, supervised multitask learning\nstill holds significant promise, as scaling it in the post-training stage\ntrends towards better generalization. In this paper, we explore supervised\nmultitask pre-training by proposing Instruction Pre-Training, a framework that\nscalably augments massive raw corpora with instruction-response pairs to\npre-train LMs. The instruction-response pairs are generated by an efficient\ninstruction synthesizer built on open-source models. In our experiments, we\nsynthesize 200M instruction-response pairs covering 40+ task categories to\nverify the effectiveness of Instruction Pre-Training. In pre-training from\nscratch, Instruction Pre-Training not only consistently enhances pre-trained\nbase models but also benefits more from further instruction tuning. In\ncontinual pre-training, Instruction Pre-Training enables Llama3-8B to be\ncomparable to or even outperform Llama3-70B. Our model, code, and data are\navailable at https://github.com/microsoft/LMOps.","upvotes":96,"discussionId":"6674ef915f7d5c8af70b566d","projectPage":"https://huggingface.co/instruction-pretrain","githubRepo":"https://github.com/microsoft/LMOps","githubRepoAddedBy":"user","ai_summary":"Instruction Pre-Training enhances language models by generating and incorporating instruction-response pairs into unsupervised multitask pre-training.","ai_keywords":["Instruction Pre-Training","instruction-response pairs","instruction synthesizer","open-source models","pre-training","continual pre-training","Llama3-8B","Llama3-70B"],"githubStars":4284},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62b425eb21218c81984c9a92","avatarUrl":"/avatars/e7aafaaf7600b6696c1229f07cd24011.svg","isPro":false,"fullname":"Oliver Pfaffel","user":"OliP","type":"user"},{"_id":"5dfb412cda6d0311fd3d5437","avatarUrl":"/avatars/b7783c2c66480613a4c46abafb25eae7.svg","isPro":false,"fullname":"Gaurish Thakkar","user":"thak123","type":"user"},{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","isPro":false,"fullname":"Daixuan Cheng","user":"daixuancheng","type":"user"},{"_id":"650801ced5578ef7e20b33d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650801ced5578ef7e20b33d4/oLptSnKMecbu62EgglmO6.png","isPro":false,"fullname":"AdaptLLM","user":"AdaptLLM","type":"user"},{"_id":"6531ea10cd5377e9ade3df30","avatarUrl":"/avatars/504233529585b8b03544746e9d441f98.svg","isPro":false,"fullname":"Youhui Bai","user":"Bert0108","type":"user"},{"_id":"639c379cdb7c5f35004066cb","avatarUrl":"/avatars/3e435506ee85aa7d2d0ec2174a07462f.svg","isPro":false,"fullname":"Zhenran Xu","user":"imryanxu","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"632bd2f72d6a805eeb4bc601","avatarUrl":"/avatars/6e1533e8a599f3068290aa69ac82cab7.svg","isPro":false,"fullname":"HUANG SHAOHAN","user":"buaahsh","type":"user"},{"_id":"60107b385ac3e86b3ea4fc34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1627505688463-60107b385ac3e86b3ea4fc34.jpeg","isPro":true,"fullname":"Daniel van Strien","user":"davanstrien","type":"user"},{"_id":"63732ebbbd81fae2b3aaf3fb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669551186189-63732ebbbd81fae2b3aaf3fb.jpeg","isPro":false,"fullname":"Knut JΓ€gersberg","user":"KnutJaegersberg","type":"user"},{"_id":"60f2fc91b92afccb7c34b8ed","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60f2fc91b92afccb7c34b8ed/W2-Nay12Ef4Ltyaf8EKE9.jpeg","isPro":true,"fullname":"Gabriel MartΓn BlΓ‘zquez","user":"gabrielmbmb","type":"user"},{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","isPro":false,"fullname":"Adina Yakefu","user":"AdinaY","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">Instruction Pre-Training: Language Models are Supervised Multitask Learners
Abstract
Instruction Pre-Training enhances language models by generating and incorporating instruction-response pairs into unsupervised multitask pre-training.
Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.
Community
π€ We share our data and models with example usages, feel free to open any issues or discussions! π€
- Thanks to the demo davanstrien/instruction-synthesizer for implementing our approach
- Context-Based Instruction Synthesizer: instruction-synthesizer
- Fine-Tuning Data for the Synthesizer: ft-instruction-synthesizer-collection
- General Models Pre-Trained from Scratch (on 100B tokes):
- Domain-Specific Models Pre-Trained from Llama3-8B:
- General Instruction-Augmented Corpora: general-instruction-augmented-corpora
- Domain-Specific Instruction-Augmented Corpora (no finance data to avoid ethical issues): medicine-instruction-augmented-corpora
Hi, What is the difference between instruction fine-tuning and instruction pre-training (in terms of training) discussed in the paper (except the fact that in IFT, we normally use parameter efficient techniques like LoRA only to update a portion of parameters)?
Hi, thanks for your interest. Except for the pre-training data, Instruction Pre-Training keeps all other pre-training settings the same as Vanilla Pre-Training. In our experiment with instruction tuning, we tune all the parameters, but I think the PEFT method would also be applicable!
Hi,
This is the dataset we use to train the instruction synthesizer. We've been thinking about how to upload the pre-training data (including the 200M instruction-response pairs), but the dataset is too largeπ€.
Hi, thanks for your work! I get few interesting insights from this.
Based on your results, can I say that we can replace "pretrain on the raw corpora -> instruction tuning" with "instruction tuning" directly? I do not see large differences between typical instruction tuning with your proposed instruction pretraining, except that in your approach, you directly train using the instruction-response pairs.
Do you perform any verifications to the generated instruction-response pairs?
Hi,
Thanks for your question!
Q1: Can I say that we can replace "pretrain on the raw corpora -> instruction tuning" with "instruction tuning" directly?
This is a promising approach worth trying. However, it may come with two limitations:
- Lack of Knowledge Source: Instruction Pre-training does not train on the instruction-response pairs alone. Instead, it trains on the concatenation of raw text and synthesized pairs, formatting the context-based task completion (e.g., reading comprehension), hoping to learn the knowledge embedded in the raw text. Vanilla instruction tuning (those without the raw text) tends to teach the pre-trained base model to follow instructions rather than learning new knowledge.
For example, in the following image, vanilla pre-training trains on the raw texts, instruction tuning trains on instruction-response pairs, whereas instruction pre-training trains on the instruction-augmented texts.
- Data Limitation: As far as I know, the existing datasets for instruction tuning are significantly smaller than those available for pre-training.
Q2: Do you perform any verifications of the generated instruction-response pairs?
Yes, in section 5 of our paper, we have checked the synthesized pairs in terms of context relevance, response accuracy, and task diversity.
In Table 3, the result of instruct-pt on MED is 61.3 and fin is 74.7, but in Table 4, the results are reversed. Is this a mistake?
Thanks for your careful review. The domain names in Table 4 should be reversed.
Have you given any thought to training a version on a higher ratio of code?
I am interested in finding a way to generate more up-to-date datasets on code libraries, examples etc... as many of today's LLMs are using quite dated knowledge thus often result in generating code with deprecated libraries and patterns. Just wondering if a more code-tuned version of this might do the trick.
Hi,
Thanks for your suggestion. You could try using our instruction synthesizer to generate instruction-response pairs based on new code materials, such as updated code textbooks. We've observed that the instruction synthesizer can generate relevant code tasks when the input text is related to the coding domain. This approach might help in creating more up-to-date datasets for code libraries and examples.
Cann you share the configurations, computation and time for instruction-response pairs generation with each raw dataset as well as for the pre-instruction training? It would help for us to work with new domains of dataset. Thank you!
Thanks for the question. We've added pre-training suggestions in the Advanced Usage section of our instruction synthesizer.
Using the vLLM inference code, on a single A100-80GB GPU, it takes about 1 day to synthesize instruction-response pairs for 1 billion tokens of raw corpora.
For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from OpenOrca at a 1:1 ratio (counted by tokens).
Other training details are presented in Table 10 in the Appendix of our paper:
Hi! From what I understand, you train the model by computing the loss on all tokens. Have you tried training the model by computing the loss only on the answers, as is commonly done during the Alignment SFT phase? Have you conducted any ablation studies on this? Or do you have any insights on why it might be better to compute the loss on all tokens?
Hi, thanks for the question. We have not tried computing the loss only on the answers yet for two reasons:
Keeping It Simple: We want our method to be very easy for everyone (including us) to use. By computing the loss on all tokens, we can just convert the data and train it like Vanilla Pre-training with familiar training setups.
Fair Comparison: If we only compute the loss on the answers, the number of tokens the model learns from is much smaller. So, if we run Vanilla Pre-training and Instruction Pre-training for the same number of steps with the same batch size, Instruction Pre-training ends up with fewer trained tokens, making it an unfair comparison.
Anyway, itβs a promising idea and worth trying since it has worked well in instruction tuning.
Hello, I reviewed the paper you suggested. Since I havenβt read through all the content of the paper, I am a bit confused.
In the general-instruction-augmented-corpora, it seems that bos and eos tokens are not used.
However, in the ft-instruction-synthesizer-collection, it seems that bos and eos tokens are used to structure the data.
Is there a difference between the two?
Thanks for your question.
Q1: In the general-instruction-augmented-corpora, it seems that BOS and EOS tokens are not used.
BOS and EOS tokens are indeed used during pre-training. Although these tokens are not shown in the released dataset, our pre-training code automatically adds BOS and EOS token IDs during tokenization.
Following GPT-3, we pack multiple instruction-augmented texts into one sequence until the maximum sequence length is reached. Suppose the tokenized IDs for a single templatified instruction-augmented text (which represents an M-shot example) are TEXT_IDs_N:
An input sequence for pre-training would look like this:bos_token_id, TEXT_IDs_1, eos_token_id, TEXT_IDs_2, eos_token_id, ... , eos_token_id, TEXT_IDs_N, eos_token_id
Q2: In the ft-instruction-synthesizer-collection, it seems that BOS and EOS tokens are used to structure the data.
When fine-tuning the instruction synthesizer, BOS and EOS tokens are also used to separate examples. Since we combined multiple examples to create a few-shot setting, we explicitly denote the presence of BOS and EOS in the script to ensure users remember to add them. Unlike pre-training, we donβt need to add BOS and EOS token IDs during tokenization in fine-tuning.
Overall, BOS and EOS tokens are utilized in both fine-tuning the instruction synthesizer and pre-training the LMs, but in different ways.