davanstrien/instruction-synthesizer for implementing our approach\n

Context-Based Instruction Synthesizer: instruction-synthesizer

Fine-Tuning Data for the Synthesizer: ft-instruction-synthesizer-collection

General Models Pre-Trained from Scratch (on 100B tokes):

InstructLM-500M
InstructLM-1.3B

Domain-Specific Models Pre-Trained from Llama3-8B:

Finance-Llama3-8B
Biomedicine-Llama3-8B

General Instruction-Augmented Corpora: general-instruction-augmented-corpora

Domain-Specific Instruction-Augmented Corpora (no finance data to avoid ethical issues): medicine-instruction-augmented-corpora

\n\n","updatedAt":"2024-07-15T09:28:57.989Z","author":{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","fullname":"Daixuan Cheng","name":"daixuancheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":17,"isUserFollowing":false}},"numEdits":8,"identifiedLanguage":{"language":"en","probability":0.8685720562934875},"editors":["daixuancheng"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg"],"reactions":[{"reaction":"🔥","users":["AdinaY","reach-vb","muhtasham","MoritzLaurer","mrm8488","AdaptLLM","instruction-pretrain","rbiswasfc","beomi","dhkong"],"count":10}],"isReport":false}},{"id":"6675fe9cc5ec2bdc94c4b26a","author":{"_id":"62c06bff60edb2dd775b6e60","avatarUrl":"/avatars/bc7f1c46653e8947f25a2ed9c269bfbb.svg","fullname":"Sakuna Harinda","name":"Sakuna","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-06-21T22:28:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi, What is the difference between instruction fine-tuning and instruction pre-training (in terms of training) discussed in the paper (except the fact that in IFT, we normally use parameter efficient techniques like LoRA only to update a portion of parameters)? ","html":"

Hi, What is the difference between instruction fine-tuning and instruction pre-training (in terms of training) discussed in the paper (except the fact that in IFT, we normally use parameter efficient techniques like LoRA only to update a portion of parameters)?

\n","updatedAt":"2024-06-21T22:28:44.171Z","author":{"_id":"62c06bff60edb2dd775b6e60","avatarUrl":"/avatars/bc7f1c46653e8947f25a2ed9c269bfbb.svg","fullname":"Sakuna Harinda","name":"Sakuna","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8844682574272156},"editors":["Sakuna"],"editorAvatarUrls":["/avatars/bc7f1c46653e8947f25a2ed9c269bfbb.svg"],"reactions":[],"isReport":false},"replies":[{"id":"6676242d8a4064c02b5a83eb","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-06-22T01:09:01.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi, thanks for your interest. Except for the pre-training data, Instruction Pre-Training keeps all other pre-training settings the same as Vanilla Pre-Training. In our experiment with instruction tuning, we tune all the parameters, but I think the PEFT method would also be applicable!","html":"

Hi, thanks for your interest. Except for the pre-training data, Instruction Pre-Training keeps all other pre-training settings the same as Vanilla Pre-Training. In our experiment with instruction tuning, we tune all the parameters, but I think the PEFT method would also be applicable!

\n","updatedAt":"2024-06-22T01:09:01.490Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9388027787208557},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"6675fe9cc5ec2bdc94c4b26a"}},{"id":"667626b45eaa9dd29902a169","author":{"_id":"62c06bff60edb2dd775b6e60","avatarUrl":"/avatars/bc7f1c46653e8947f25a2ed9c269bfbb.svg","fullname":"Sakuna Harinda","name":"Sakuna","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-06-22T01:19:48.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the reply. So then the training procedure of instruction tuning and instruction pre-training is the same, right? ","html":"

Thanks for the reply. So then the training procedure of instruction tuning and instruction pre-training is the same, right?

\n","updatedAt":"2024-06-22T01:19:48.725Z","author":{"_id":"62c06bff60edb2dd775b6e60","avatarUrl":"/avatars/bc7f1c46653e8947f25a2ed9c269bfbb.svg","fullname":"Sakuna Harinda","name":"Sakuna","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9471943378448486},"editors":["Sakuna"],"editorAvatarUrls":["/avatars/bc7f1c46653e8947f25a2ed9c269bfbb.svg"],"reactions":[],"isReport":false,"parentCommentId":"6675fe9cc5ec2bdc94c4b26a"}},{"id":"667630495eaa9dd29905fc70","author":{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","fullname":"Daixuan Cheng","name":"daixuancheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":17,"isUserFollowing":false},"createdAt":"2024-06-22T02:00:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi, since our focus is on the pre-training stage rather than the tuning stage, we maintain the tuning settings consistent with previous works. We compute the loss only on the output response part of each instruction-response pair. Additionally, the learning rate during the tuning stage is much smaller than that during pre-training. For example, we use a learning rate of 5e-6 for tuning and 3e-4 for pre-training. The other settings of tuning are the same as pre-training (as shown in Table 10 in Appendix).","html":"

Hi, since our focus is on the pre-training stage rather than the tuning stage, we maintain the tuning settings consistent with previous works. We compute the loss only on the output response part of each instruction-response pair. Additionally, the learning rate during the tuning stage is much smaller than that during pre-training. For example, we use a learning rate of 5e-6 for tuning and 3e-4 for pre-training. The other settings of tuning are the same as pre-training (as shown in Table 10 in Appendix).

\n","updatedAt":"2024-06-22T02:00:41.207Z","author":{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","fullname":"Daixuan Cheng","name":"daixuancheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":17,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9579322934150696},"editors":["daixuancheng"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6675fe9cc5ec2bdc94c4b26a"}}]},{"id":"667623e966c4fa6d0c285d9b","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-06-22T01:07:53.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2024-06-22T01:08:35.759Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"66771c67c5786a0b1eb71363","author":{"_id":"5fc6879e1c5ee87b1164876d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg","fullname":"Huu Nguyen","name":"huu-ontocord","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":56,"isUserFollowing":false},"createdAt":"2024-06-22T18:48:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi - is the dataset you released the 200M one?\n","html":"

Hi - is the dataset you released the 200M one?

\n","updatedAt":"2024-06-22T18:48:07.427Z","author":{"_id":"5fc6879e1c5ee87b1164876d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg","fullname":"Huu Nguyen","name":"huu-ontocord","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":56,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9673436284065247},"editors":["huu-ontocord"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg"],"reactions":[{"reaction":"👍","users":["mrm8488"],"count":1}],"isReport":false},"replies":[{"id":"66777dcc9f2810b0094632ac","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-06-23T01:43:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi,\n\nThis is the dataset we use to train the instruction synthesizer. We've been thinking about how to upload the pre-training data (including the 200M instruction-response pairs), but the dataset is too large🤔.","html":"

Hi,

This is the dataset we use to train the instruction synthesizer. We've been thinking about how to upload the pre-training data (including the 200M instruction-response pairs), but the dataset is too large🤔.

\n","updatedAt":"2024-06-23T01:43:40.018Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9282596111297607},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"66771c67c5786a0b1eb71363"}},{"id":"6677a9810f4805115948e118","author":{"_id":"5fc6879e1c5ee87b1164876d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg","fullname":"Huu Nguyen","name":"huu-ontocord","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":56,"isUserFollowing":false},"createdAt":"2024-06-23T04:50:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"The 200m would be awesome to upload - preferable open license like CC-BY. It shouldn't be too much as shards of jsonl or parquett. For the non instruciton pretraining, if it's just public datasets, you can just point people to the recipe. IMO","html":"

The 200m would be awesome to upload - preferable open license like CC-BY. It shouldn't be too much as shards of jsonl or parquett. For the non instruciton pretraining, if it's just public datasets, you can just point people to the recipe. IMO

\n","updatedAt":"2024-06-23T04:50:09.661Z","author":{"_id":"5fc6879e1c5ee87b1164876d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg","fullname":"Huu Nguyen","name":"huu-ontocord","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":56,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9329466819763184},"editors":["huu-ontocord"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66771c67c5786a0b1eb71363"}},{"id":"6677b729d0970f1efa0915d1","author":{"_id":"650801ced5578ef7e20b33d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650801ced5578ef7e20b33d4/oLptSnKMecbu62EgglmO6.png","fullname":"AdaptLLM","name":"AdaptLLM","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":275,"isUserFollowing":false},"createdAt":"2024-06-23T05:48:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the suggestion, we will consider open-sourcing the necessary parts.","html":"

Thanks for the suggestion, we will consider open-sourcing the necessary parts.

\n","updatedAt":"2024-06-23T05:48:25.077Z","author":{"_id":"650801ced5578ef7e20b33d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650801ced5578ef7e20b33d4/oLptSnKMecbu62EgglmO6.png","fullname":"AdaptLLM","name":"AdaptLLM","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":275,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.926403284072876},"editors":["AdaptLLM"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/650801ced5578ef7e20b33d4/oLptSnKMecbu62EgglmO6.png"],"reactions":[],"isReport":false,"parentCommentId":"66771c67c5786a0b1eb71363"}}]},{"id":"6678e83a5f7d5c8af789a9c7","author":{"_id":"61bd6bbb92dd56338582ebbd","avatarUrl":"/avatars/3637ab04f9342cf15340a47c9645722a.svg","fullname":"Imam Nur Bani Yusuf","name":"imamnurby","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2024-06-24T03:30:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi, thanks for your work! I get few interesting insights from this.\n\n1. Based on your results, can I say that we can replace \"pretrain on the raw corpora -> instruction tuning\" with \"instruction tuning\" directly? I do not see large differences between typical instruction tuning with your proposed instruction pretraining, except that in your approach, you directly train using the instruction-response pairs.\n\n2. Do you perform any verifications to the generated instruction-response pairs?","html":"

Hi, thanks for your work! I get few interesting insights from this.

Based on your results, can I say that we can replace \"pretrain on the raw corpora -> instruction tuning\" with \"instruction tuning\" directly? I do not see large differences between typical instruction tuning with your proposed instruction pretraining, except that in your approach, you directly train using the instruction-response pairs.
\n
Do you perform any verifications to the generated instruction-response pairs?
\n

\n","updatedAt":"2024-06-24T03:30:02.637Z","author":{"_id":"61bd6bbb92dd56338582ebbd","avatarUrl":"/avatars/3637ab04f9342cf15340a47c9645722a.svg","fullname":"Imam Nur Bani Yusuf","name":"imamnurby","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9252773523330688},"editors":["imamnurby"],"editorAvatarUrls":["/avatars/3637ab04f9342cf15340a47c9645722a.svg"],"reactions":[],"isReport":false},"replies":[{"id":"6678efe2c411b340b375d5f4","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-06-24T04:02:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi,\n\nThanks for your question!\n\n**Q1: Can I say that we can replace \"pretrain on the raw corpora -> instruction tuning\" with \"instruction tuning\" directly?**\n\nThis is a promising approach worth trying. However, it may come with two limitations:\n\n1. **Lack of Knowledge Source:** Instruction Pre-training does not train on the instruction-response pairs alone. Instead, it trains on the concatenation of raw text and synthesized pairs, formatting the context-based task completion (e.g., reading comprehension), hoping to learn the knowledge embedded in the raw text. Vanilla instruction tuning (those without the raw text) tends to teach the pre-trained base model to follow instructions rather than learning new knowledge.\n\nFor example, in the following image, vanilla pre-training trains on the raw texts, instruction tuning trains on instruction-response pairs, whereas instruction pre-training trains on the instruction-augmented texts.\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/jAbi1-bYjtoO_gMMKR_ld.png)\n\n2. **Data Limitation:** As far as I know, the existing datasets for instruction tuning are significantly smaller than those available for pre-training.\n\n**Q2: Do you perform any verifications of the generated instruction-response pairs?**\n\nYes, in section 5 of our paper, we have checked the synthesized pairs in terms of context relevance, response accuracy, and task diversity.\n\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/ZHnZK1UKktU1ZbQ_b67dE.png)","html":"

Hi,

Thanks for your question!

Q1: Can I say that we can replace \"pretrain on the raw corpora -> instruction tuning\" with \"instruction tuning\" directly?

This is a promising approach worth trying. However, it may come with two limitations:

Lack of Knowledge Source: Instruction Pre-training does not train on the instruction-response pairs alone. Instead, it trains on the concatenation of raw text and synthesized pairs, formatting the context-based task completion (e.g., reading comprehension), hoping to learn the knowledge embedded in the raw text. Vanilla instruction tuning (those without the raw text) tends to teach the pre-trained base model to follow instructions rather than learning new knowledge.

For example, in the following image, vanilla pre-training trains on the raw texts, instruction tuning trains on instruction-response pairs, whereas instruction pre-training trains on the instruction-augmented texts.
$\"image.png\"$

Data Limitation: As far as I know, the existing datasets for instruction tuning are significantly smaller than those available for pre-training.

Q2: Do you perform any verifications of the generated instruction-response pairs?

Yes, in section 5 of our paper, we have checked the synthesized pairs in terms of context relevance, response accuracy, and task diversity.

$\"image.png\"$

\n","updatedAt":"2024-06-24T04:02:42.382Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8317918181419373},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"6678e83a5f7d5c8af789a9c7"}}]},{"id":"66793eae82f924034c47f634","author":{"_id":"630475266dbbb80f1634cb9c","avatarUrl":"/avatars/2e11ccfecfa18dbef673c759102ed962.svg","fullname":"Phy","name":"s-JoL","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":16,"isUserFollowing":false},"createdAt":"2024-06-24T09:38:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"In Table 3, the result of instruct-pt on MED is 61.3 and fin is 74.7, but in Table 4, the results are reversed. Is this a mistake?","html":"

In Table 3, the result of instruct-pt on MED is 61.3 and fin is 74.7, but in Table 4, the results are reversed. Is this a mistake?

\n","updatedAt":"2024-06-24T09:38:54.821Z","author":{"_id":"630475266dbbb80f1634cb9c","avatarUrl":"/avatars/2e11ccfecfa18dbef673c759102ed962.svg","fullname":"Phy","name":"s-JoL","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":16,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9573521018028259},"editors":["s-JoL"],"editorAvatarUrls":["/avatars/2e11ccfecfa18dbef673c759102ed962.svg"],"reactions":[],"isReport":false},"replies":[{"id":"667958df35f7ad9125259887","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-06-24T11:30:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for your careful review. The domain names in Table 4 should be reversed.","html":"

Thanks for your careful review. The domain names in Table 4 should be reversed.

\n","updatedAt":"2024-06-24T11:30:39.001Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9076557755470276},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"66793eae82f924034c47f634"}}]},{"id":"6679e810f1cc6ceabe052376","author":{"_id":"630fff3f02ce39336c495fe9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630fff3f02ce39336c495fe9/CZmQtRB4eGVbRBYT3_IH3.png","fullname":"Sam McLeod","name":"smcleod","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false},"createdAt":"2024-06-24T21:41:36.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Have you given any thought to training a version on a higher ratio of code?\n\nI am interested in finding a way to generate more up-to-date datasets on code libraries, examples etc... as many of today's LLMs are using quite dated knowledge thus often result in generating code with deprecated libraries and patterns. Just wondering if a more code-tuned version of this might do the trick.","html":"

Have you given any thought to training a version on a higher ratio of code?

I am interested in finding a way to generate more up-to-date datasets on code libraries, examples etc... as many of today's LLMs are using quite dated knowledge thus often result in generating code with deprecated libraries and patterns. Just wondering if a more code-tuned version of this might do the trick.

\n","updatedAt":"2024-06-24T21:41:36.157Z","author":{"_id":"630fff3f02ce39336c495fe9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630fff3f02ce39336c495fe9/CZmQtRB4eGVbRBYT3_IH3.png","fullname":"Sam McLeod","name":"smcleod","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9268738031387329},"editors":["smcleod"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/630fff3f02ce39336c495fe9/CZmQtRB4eGVbRBYT3_IH3.png"],"reactions":[],"isReport":false},"replies":[{"id":"667a22a742ef3dbfae2d2fc7","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-06-25T01:51:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi,\n\nThanks for your suggestion. You could try using our instruction synthesizer to generate instruction-response pairs based on new code materials, such as updated code textbooks. We've observed that the instruction synthesizer can generate relevant code tasks when the input text is related to the coding domain. This approach might help in creating more up-to-date datasets for code libraries and examples.","html":"

Hi,

Thanks for your suggestion. You could try using our instruction synthesizer to generate instruction-response pairs based on new code materials, such as updated code textbooks. We've observed that the instruction synthesizer can generate relevant code tasks when the input text is related to the coding domain. This approach might help in creating more up-to-date datasets for code libraries and examples.

\n","updatedAt":"2024-06-25T01:51:35.865Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8864392638206482},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"6679e810f1cc6ceabe052376"}}]},{"id":"669a65816adaedf897f2923c","author":{"_id":"64b747aff902508f0d782068","avatarUrl":"/avatars/023a36d8a6b7e7b49dc63ad9ac030b36.svg","fullname":"Le Van Duc","name":"levanduc","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2024-07-19T13:09:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Cann you share the configurations, computation and time for instruction-response pairs generation with each raw dataset as well as for the pre-instruction training? It would help for us to work with new domains of dataset. Thank you!","html":"

Cann you share the configurations, computation and time for instruction-response pairs generation with each raw dataset as well as for the pre-instruction training? It would help for us to work with new domains of dataset. Thank you!

\n","updatedAt":"2024-07-19T13:09:21.313Z","author":{"_id":"64b747aff902508f0d782068","avatarUrl":"/avatars/023a36d8a6b7e7b49dc63ad9ac030b36.svg","fullname":"Le Van Duc","name":"levanduc","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9425191879272461},"editors":["levanduc"],"editorAvatarUrls":["/avatars/023a36d8a6b7e7b49dc63ad9ac030b36.svg"],"reactions":[],"isReport":false},"replies":[{"id":"669a6e7f2dbf53ccd206d852","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-07-19T13:47:43.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Thanks for the question. We've added pre-training suggestions in the `Advanced Usage` section of our [instruction synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer).\n\nUsing the vLLM inference code, on a single A100-80GB GPU, it takes about 1 day to synthesize instruction-response pairs for 1 billion tokens of raw corpora.\n\nFor domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) at a 1:1 ratio (counted by tokens).\n\nOther training details are presented in Table 10 in the Appendix of our paper:\n\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/GOcJMwIX7Z6UWyQi-pwMI.png)\n\n","html":"

Thanks for the question. We've added pre-training suggestions in the Advanced Usage section of our instruction synthesizer.

Using the vLLM inference code, on a single A100-80GB GPU, it takes about 1 day to synthesize instruction-response pairs for 1 billion tokens of raw corpora.

For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from OpenOrca at a 1:1 ratio (counted by tokens).

Other training details are presented in Table 10 in the Appendix of our paper:

$\"image.png\"$

\n","updatedAt":"2024-09-04T02:49:07.223Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7829620838165283},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[{"reaction":"❤️","users":["levanduc"],"count":1}],"isReport":false,"parentCommentId":"669a65816adaedf897f2923c"}}]},{"id":"66ab819173f8601ad1c50114","author":{"_id":"64f4c8739ee58d48e8507e0e","avatarUrl":"/avatars/4be540dfb4a949f37cba2d3c3729fbde.svg","fullname":"Dmitrii Stoianov","name":"heylimon","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false},"createdAt":"2024-08-01T12:37:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi! From what I understand, you train the model by computing the loss on all tokens. Have you tried training the model by computing the loss only on the answers, as is commonly done during the Alignment SFT phase? Have you conducted any ablation studies on this? Or do you have any insights on why it might be better to compute the loss on all tokens?","html":"

Hi! From what I understand, you train the model by computing the loss on all tokens. Have you tried training the model by computing the loss only on the answers, as is commonly done during the Alignment SFT phase? Have you conducted any ablation studies on this? Or do you have any insights on why it might be better to compute the loss on all tokens?

\n","updatedAt":"2024-08-01T12:37:37.524Z","author":{"_id":"64f4c8739ee58d48e8507e0e","avatarUrl":"/avatars/4be540dfb4a949f37cba2d3c3729fbde.svg","fullname":"Dmitrii Stoianov","name":"heylimon","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9722594618797302},"editors":["heylimon"],"editorAvatarUrls":["/avatars/4be540dfb4a949f37cba2d3c3729fbde.svg"],"reactions":[],"isReport":false},"replies":[{"id":"66ab8d47386308c8662cdfeb","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-08-01T13:27:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi, thanks for the question. We have not tried computing the loss only on the answers yet for two reasons:\n\n- **Keeping It Simple:** We want our method to be very easy for everyone (including us) to use. By computing the loss on all tokens, we can just convert the data and train it like *Vanilla Pre-training* with familiar training setups.\n\n- **Fair Comparison:** If we only compute the loss on the answers, the number of tokens the model learns from is much smaller. So, if we run *Vanilla Pre-training* and *Instruction Pre-training* for the same number of steps with the same batch size, *Instruction Pre-training* ends up with fewer trained tokens, making it an unfair comparison.\n\nAnyway, it’s a promising idea and worth trying since it has worked well in instruction tuning.","html":"

Hi, thanks for the question. We have not tried computing the loss only on the answers yet for two reasons:

Keeping It Simple: We want our method to be very easy for everyone (including us) to use. By computing the loss on all tokens, we can just convert the data and train it like Vanilla Pre-training with familiar training setups.
\n
Fair Comparison: If we only compute the loss on the answers, the number of tokens the model learns from is much smaller. So, if we run Vanilla Pre-training and Instruction Pre-training for the same number of steps with the same batch size, Instruction Pre-training ends up with fewer trained tokens, making it an unfair comparison.
\n

Anyway, it’s a promising idea and worth trying since it has worked well in instruction tuning.

\n","updatedAt":"2024-08-01T13:27:35.568Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9240208268165588},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[{"reaction":"👍","users":["heylimon","iqbalamo93"],"count":2}],"isReport":false,"parentCommentId":"66ab819173f8601ad1c50114"}},{"id":"66d0d6ff1f9d35fb75fa3135","author":{"_id":"630d00c43dc31beba6ef1e05","avatarUrl":"/avatars/d336fd693cfd9e7d4864ee51e664e619.svg","fullname":"Isingh","name":"iqbalamo93","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2024-08-29T20:15:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the first response; it clarifies the doubt. I guess the confusion is stemming from the line in the paper that says, 'Additionally, we calculate the tuning loss only on the instruction-response pairs to guide the model to focus on these pairs.' Can you please help me understand if this tuning loss is a part of the pre-training stage? or where does this tuning loss fits in?","html":"

Thanks for the first response; it clarifies the doubt. I guess the confusion is stemming from the line in the paper that says, 'Additionally, we calculate the tuning loss only on the instruction-response pairs to guide the model to focus on these pairs.' Can you please help me understand if this tuning loss is a part of the pre-training stage? or where does this tuning loss fits in?

\n","updatedAt":"2024-08-29T20:15:59.541Z","author":{"_id":"630d00c43dc31beba6ef1e05","avatarUrl":"/avatars/d336fd693cfd9e7d4864ee51e664e619.svg","fullname":"Isingh","name":"iqbalamo93","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.921776533126831},"editors":["iqbalamo93"],"editorAvatarUrls":["/avatars/d336fd693cfd9e7d4864ee51e664e619.svg"],"reactions":[],"isReport":false,"parentCommentId":"66ab819173f8601ad1c50114"}},{"id":"66d125f3d51528a0386244c1","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-08-30T01:52:51.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"> Can you please help me understand if this tuning loss is a part of the pre-training stage? or where does this tuning loss fits in?\n\nHi, the tuning loss is for tuning the instruction synthesizer, not for the pre-training stage. Specifically, we first tune the instruction synthesizer (where the tuning loss fits in), which is then used to augment the pre-training corpora. Afterward, we use the augmented corpora to pre-train a language model.","html":"

\n
Can you please help me understand if this tuning loss is a part of the pre-training stage? or where does this tuning loss fits in?
\n

Hi, the tuning loss is for tuning the instruction synthesizer, not for the pre-training stage. Specifically, we first tune the instruction synthesizer (where the tuning loss fits in), which is then used to augment the pre-training corpora. Afterward, we use the augmented corpora to pre-train a language model.

\n","updatedAt":"2024-08-30T01:52:51.325Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9006796479225159},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[{"reaction":"❤️","users":["iqbalamo93"],"count":1}],"isReport":false,"parentCommentId":"66ab819173f8601ad1c50114"}}]},{"id":"66ab8cdef7c47d974ece613f","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-08-01T13:25:50.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2024-08-01T13:27:12.831Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"66d0352f85fb9cb99f101c73","author":{"_id":"631c418c71f8e7137df05f3e","avatarUrl":"/avatars/0157a31fb19b618fefa8e7c88227aeca.svg","fullname":"yeontaek oh ","name":"yeontaek","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false},"createdAt":"2024-08-29T08:45:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hello, I reviewed the paper you suggested. Since I haven’t read through all the content of the paper, I am a bit confused.\n\nIn the general-instruction-augmented-corpora, it seems that bos and eos tokens are not used.\n\nHowever, in the ft-instruction-synthesizer-collection, it seems that bos and eos tokens are used to structure the data.\n\nIs there a difference between the two?","html":"

Hello, I reviewed the paper you suggested. Since I haven’t read through all the content of the paper, I am a bit confused.

In the general-instruction-augmented-corpora, it seems that bos and eos tokens are not used.

However, in the ft-instruction-synthesizer-collection, it seems that bos and eos tokens are used to structure the data.

Is there a difference between the two?

\n","updatedAt":"2024-08-29T08:45:35.490Z","author":{"_id":"631c418c71f8e7137df05f3e","avatarUrl":"/avatars/0157a31fb19b618fefa8e7c88227aeca.svg","fullname":"yeontaek oh ","name":"yeontaek","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9442613124847412},"editors":["yeontaek"],"editorAvatarUrls":["/avatars/0157a31fb19b618fefa8e7c88227aeca.svg"],"reactions":[],"isReport":false},"replies":[{"id":"66d03cc8ed0167bec30807bf","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false},"createdAt":"2024-08-29T09:18:00.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Thanks for your question.\n\n**Q1: In the general-instruction-augmented-corpora, it seems that BOS and EOS tokens are not used.**\n\nBOS and EOS tokens are indeed used during pre-training. Although these tokens are not shown in the released dataset, our pre-training code automatically adds BOS and EOS token IDs during tokenization.\n\nFollowing GPT-3, we pack multiple instruction-augmented texts into one sequence until the maximum sequence length is reached. Suppose the tokenized IDs for a single templatified instruction-augmented text (which represents an M-shot example) are `TEXT_IDs_N`:\n\nAn input sequence for pre-training would look like this:\n`bos_token_id, TEXT_IDs_1, eos_token_id, TEXT_IDs_2, eos_token_id, ... , eos_token_id, TEXT_IDs_N, eos_token_id`\n\n**Q2: In the ft-instruction-synthesizer-collection, it seems that BOS and EOS tokens are used to structure the data.**\n\nWhen fine-tuning the instruction synthesizer, BOS and EOS tokens are also used to separate examples. Since we combined multiple examples to create a few-shot setting, we explicitly denote the presence of BOS and EOS in the script to ensure users remember to add them. Unlike pre-training, we don’t need to add BOS and EOS token IDs during tokenization in fine-tuning.\n\nOverall, BOS and EOS tokens are utilized in both fine-tuning the instruction synthesizer and pre-training the LMs, but in different ways.\n","html":"

Thanks for your question.

Q1: In the general-instruction-augmented-corpora, it seems that BOS and EOS tokens are not used.

BOS and EOS tokens are indeed used during pre-training. Although these tokens are not shown in the released dataset, our pre-training code automatically adds BOS and EOS token IDs during tokenization.

Following GPT-3, we pack multiple instruction-augmented texts into one sequence until the maximum sequence length is reached. Suppose the tokenized IDs for a single templatified instruction-augmented text (which represents an M-shot example) are TEXT_IDs_N:

An input sequence for pre-training would look like this:
bos_token_id, TEXT_IDs_1, eos_token_id, TEXT_IDs_2, eos_token_id, ... , eos_token_id, TEXT_IDs_N, eos_token_id

Q2: In the ft-instruction-synthesizer-collection, it seems that BOS and EOS tokens are used to structure the data.

When fine-tuning the instruction synthesizer, BOS and EOS tokens are also used to separate examples. Since we combined multiple examples to create a few-shot setting, we explicitly denote the presence of BOS and EOS in the script to ensure users remember to add them. Unlike pre-training, we don’t need to add BOS and EOS token IDs during tokenization in fine-tuning.

Overall, BOS and EOS tokens are utilized in both fine-tuning the instruction synthesizer and pre-training the LMs, but in different ways.

\n","updatedAt":"2024-08-29T09:41:58.092Z","author":{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","fullname":"instruction-pretrain","name":"instruction-pretrain","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":128,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.9180124402046204},"editors":["instruction-pretrain"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png"],"reactions":[],"isReport":false,"parentCommentId":"66d0352f85fb9cb99f101c73"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2406.14491","authors":[{"_id":"6674ef905f7d5c8af70b5607","user":{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","isPro":false,"fullname":"Daixuan Cheng","user":"daixuancheng","type":"user"},"name":"Daixuan Cheng","status":"claimed_verified","statusLastChangedAt":"2024-06-21T07:05:43.020Z","hidden":false},{"_id":"6674ef905f7d5c8af70b5608","user":{"_id":"624ac662102fcdff87be51b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624ac662102fcdff87be51b9/rzNahZFFkp194170tactJ.jpeg","isPro":false,"fullname":"Yuxian Gu","user":"t1101675","type":"user"},"name":"Yuxian Gu","status":"admin_assigned","statusLastChangedAt":"2024-06-21T11:12:42.585Z","hidden":false},{"_id":"6674ef905f7d5c8af70b5609","user":{"_id":"632bd2f72d6a805eeb4bc601","avatarUrl":"/avatars/6e1533e8a599f3068290aa69ac82cab7.svg","isPro":false,"fullname":"HUANG SHAOHAN","user":"buaahsh","type":"user"},"name":"Shaohan Huang","status":"admin_assigned","statusLastChangedAt":"2024-06-21T11:12:55.699Z","hidden":false},{"_id":"6674ef905f7d5c8af70b560a","name":"Junyu Bi","hidden":false},{"_id":"6674ef905f7d5c8af70b560b","name":"Minlie Huang","hidden":false},{"_id":"6674ef905f7d5c8af70b560c","user":{"_id":"6368c512fbfe97c16a40baba","avatarUrl":"/avatars/1c23bc7c0b6d9225699ce27647623d7a.svg","isPro":false,"fullname":"Furu Wei","user":"thegenerality","type":"user"},"name":"Furu Wei","status":"admin_assigned","statusLastChangedAt":"2024-06-21T11:13:45.390Z","hidden":false}],"publishedAt":"2024-06-20T16:55:33.000Z","submittedOnDailyAt":"2024-06-21T01:42:56.560Z","title":"Instruction Pre-Training: Language Models are Supervised Multitask\n Learners","submittedOnDailyBy":{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","isPro":false,"fullname":"Daixuan Cheng","user":"daixuancheng","type":"user"},"summary":"Unsupervised multitask pre-training has been the critical method behind the\nrecent success of language models (LMs). However, supervised multitask learning\nstill holds significant promise, as scaling it in the post-training stage\ntrends towards better generalization. In this paper, we explore supervised\nmultitask pre-training by proposing Instruction Pre-Training, a framework that\nscalably augments massive raw corpora with instruction-response pairs to\npre-train LMs. The instruction-response pairs are generated by an efficient\ninstruction synthesizer built on open-source models. In our experiments, we\nsynthesize 200M instruction-response pairs covering 40+ task categories to\nverify the effectiveness of Instruction Pre-Training. In pre-training from\nscratch, Instruction Pre-Training not only consistently enhances pre-trained\nbase models but also benefits more from further instruction tuning. In\ncontinual pre-training, Instruction Pre-Training enables Llama3-8B to be\ncomparable to or even outperform Llama3-70B. Our model, code, and data are\navailable at https://github.com/microsoft/LMOps.","upvotes":96,"discussionId":"6674ef915f7d5c8af70b566d","projectPage":"https://huggingface.co/instruction-pretrain","githubRepo":"https://github.com/microsoft/LMOps","githubRepoAddedBy":"user","ai_summary":"Instruction Pre-Training enhances language models by generating and incorporating instruction-response pairs into unsupervised multitask pre-training.","ai_keywords":["Instruction Pre-Training","instruction-response pairs","instruction synthesizer","open-source models","pre-training","continual pre-training","Llama3-8B","Llama3-70B"],"githubStars":4284},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62b425eb21218c81984c9a92","avatarUrl":"/avatars/e7aafaaf7600b6696c1229f07cd24011.svg","isPro":false,"fullname":"Oliver Pfaffel","user":"OliP","type":"user"},{"_id":"5dfb412cda6d0311fd3d5437","avatarUrl":"/avatars/b7783c2c66480613a4c46abafb25eae7.svg","isPro":false,"fullname":"Gaurish Thakkar","user":"thak123","type":"user"},{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","isPro":false,"fullname":"Daixuan Cheng","user":"daixuancheng","type":"user"},{"_id":"650801ced5578ef7e20b33d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650801ced5578ef7e20b33d4/oLptSnKMecbu62EgglmO6.png","isPro":false,"fullname":"AdaptLLM","user":"AdaptLLM","type":"user"},{"_id":"6531ea10cd5377e9ade3df30","avatarUrl":"/avatars/504233529585b8b03544746e9d441f98.svg","isPro":false,"fullname":"Youhui Bai","user":"Bert0108","type":"user"},{"_id":"639c379cdb7c5f35004066cb","avatarUrl":"/avatars/3e435506ee85aa7d2d0ec2174a07462f.svg","isPro":false,"fullname":"Zhenran Xu","user":"imryanxu","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"632bd2f72d6a805eeb4bc601","avatarUrl":"/avatars/6e1533e8a599f3068290aa69ac82cab7.svg","isPro":false,"fullname":"HUANG SHAOHAN","user":"buaahsh","type":"user"},{"_id":"60107b385ac3e86b3ea4fc34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1627505688463-60107b385ac3e86b3ea4fc34.jpeg","isPro":true,"fullname":"Daniel van Strien","user":"davanstrien","type":"user"},{"_id":"63732ebbbd81fae2b3aaf3fb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669551186189-63732ebbbd81fae2b3aaf3fb.jpeg","isPro":false,"fullname":"Knut Jägersberg","user":"KnutJaegersberg","type":"user"},{"_id":"60f2fc91b92afccb7c34b8ed","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60f2fc91b92afccb7c34b8ed/W2-Nay12Ef4Ltyaf8EKE9.jpeg","isPro":true,"fullname":"Gabriel Martín Blázquez","user":"gabrielmbmb","type":"user"},{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","isPro":false,"fullname":"Adina Yakefu","user":"AdinaY","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">

Papers

arxiv:2406.14491

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Published on Jun 20, 2024

· Submitted by

Daixuan Cheng on Jun 21, 2024

#2 Paper of the day

Upvote

Authors:

Daixuan Cheng ,

Yuxian Gu ,

Shaohan Huang ,

Furu Wei

Abstract

Instruction Pre-Training enhances language models by generating and incorporating instruction-response pairs into unsupervised multitask pre-training.

AI-generated summary

Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.

View arXiv page View PDF Project page GitHub 4.28k Add to collection

Community

daixuancheng

Paper author Paper submitter Jun 21, 2024

•

edited Jul 15, 2024

🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗

Thanks to the demo davanstrien/instruction-synthesizer for implementing our approach
Context-Based Instruction Synthesizer: instruction-synthesizer
Fine-Tuning Data for the Synthesizer: ft-instruction-synthesizer-collection
General Models Pre-Trained from Scratch (on 100B tokes):
- InstructLM-500M
- InstructLM-1.3B
Domain-Specific Models Pre-Trained from Llama3-8B:
- Finance-Llama3-8B
- Biomedicine-Llama3-8B
General Instruction-Augmented Corpora: general-instruction-augmented-corpora
Domain-Specific Instruction-Augmented Corpora (no finance data to avoid ethical issues): medicine-instruction-augmented-corpora

Sakuna

Jun 21, 2024

instruction-pretrain

Jun 22, 2024

instruction-pretrain

Jun 22, 2024

This comment has been hidden

huu-ontocord

Jun 22, 2024

Hi - is the dataset you released the 200M one?

instruction-pretrain

Jun 23, 2024

Hi,

imamnurby

Jun 24, 2024

Hi, thanks for your work! I get few interesting insights from this.

Based on your results, can I say that we can replace "pretrain on the raw corpora -> instruction tuning" with "instruction tuning" directly? I do not see large differences between typical instruction tuning with your proposed instruction pretraining, except that in your approach, you directly train using the instruction-response pairs.
Do you perform any verifications to the generated instruction-response pairs?

instruction-pretrain

Jun 24, 2024

Hi,

Thanks for your question!

Q1: Can I say that we can replace "pretrain on the raw corpora -> instruction tuning" with "instruction tuning" directly?

This is a promising approach worth trying. However, it may come with two limitations:

Lack of Knowledge Source: Instruction Pre-training does not train on the instruction-response pairs alone. Instead, it trains on the concatenation of raw text and synthesized pairs, formatting the context-based task completion (e.g., reading comprehension), hoping to learn the knowledge embedded in the raw text. Vanilla instruction tuning (those without the raw text) tends to teach the pre-trained base model to follow instructions rather than learning new knowledge.

Data Limitation: As far as I know, the existing datasets for instruction tuning are significantly smaller than those available for pre-training.

Q2: Do you perform any verifications of the generated instruction-response pairs?

Yes, in section 5 of our paper, we have checked the synthesized pairs in terms of context relevance, response accuracy, and task diversity.

s-JoL

Jun 24, 2024

In Table 3, the result of instruct-pt on MED is 61.3 and fin is 74.7, but in Table 4, the results are reversed. Is this a mistake?

instruction-pretrain

Jun 24, 2024

Thanks for your careful review. The domain names in Table 4 should be reversed.

smcleod

Jun 24, 2024

Have you given any thought to training a version on a higher ratio of code?

instruction-pretrain

Jun 25, 2024

Hi,

levanduc

Jul 19, 2024

instruction-pretrain

Jul 19, 2024

•

edited Sep 4, 2024

Thanks for the question. We've added pre-training suggestions in the Advanced Usage section of our instruction synthesizer.

Using the vLLM inference code, on a single A100-80GB GPU, it takes about 1 day to synthesize instruction-response pairs for 1 billion tokens of raw corpora.

For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from OpenOrca at a 1:1 ratio (counted by tokens).

Other training details are presented in Table 10 in the Appendix of our paper:

heylimon

Aug 1, 2024

instruction-pretrain

Aug 1, 2024

Hi, thanks for the question. We have not tried computing the loss only on the answers yet for two reasons:

Keeping It Simple: We want our method to be very easy for everyone (including us) to use. By computing the loss on all tokens, we can just convert the data and train it like Vanilla Pre-training with familiar training setups.
Fair Comparison: If we only compute the loss on the answers, the number of tokens the model learns from is much smaller. So, if we run Vanilla Pre-training and Instruction Pre-training for the same number of steps with the same batch size, Instruction Pre-training ends up with fewer trained tokens, making it an unfair comparison.

Anyway, it’s a promising idea and worth trying since it has worked well in instruction tuning.

instruction-pretrain

Aug 1, 2024

This comment has been hidden

yeontaek

Aug 29, 2024

Hello, I reviewed the paper you suggested. Since I haven’t read through all the content of the paper, I am a bit confused.

In the general-instruction-augmented-corpora, it seems that bos and eos tokens are not used.

However, in the ft-instruction-synthesizer-collection, it seems that bos and eos tokens are used to structure the data.

Is there a difference between the two?

instruction-pretrain

Aug 29, 2024

•

edited Aug 29, 2024

Thanks for your question.

Q1: In the general-instruction-augmented-corpora, it seems that BOS and EOS tokens are not used.

An input sequence for pre-training would look like this:
bos_token_id, TEXT_IDs_1, eos_token_id, TEXT_IDs_2, eos_token_id, ... , eos_token_id, TEXT_IDs_N, eos_token_id

Q2: In the ft-instruction-synthesizer-collection, it seems that BOS and EOS tokens are used to structure the data.

Overall, BOS and EOS tokens are utilized in both fine-tuning the instruction synthesizer and pre-training the LMs, but in different ways.