Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - SantaCoder: don't reach for the stars!
@librarian-bot\n\t recommend\n","updatedAt":"2024-04-09T07:30:46.113Z","author":{"_id":"62cd5057674cdb524450093d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg","fullname":"Mayank Mishra","name":"mayank-mishra","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":66,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["mayank-mishra"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg"],"reactions":[],"isReport":false}},{"id":"6614eeb0660ba0fba7276b46","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-04-09T07:30:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks](https://huggingface.co/papers/2403.04814) (2024)\n* [CodeShell Technical Report](https://huggingface.co/papers/2403.15747) (2024)\n* [StarCoder 2 and The Stack v2: The Next Generation](https://huggingface.co/papers/2402.19173) (2024)\n* [EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories](https://huggingface.co/papers/2404.00599) (2024)\n* [Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a Haskell Case Study](https://huggingface.co/papers/2403.15185) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-04-09T07:30:56.183Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7365776300430298},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2301.03988","authors":[{"_id":"6411c77f6b75ddced3890d42","user":{"_id":"61c141342aac764ce1654e43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/81AwoT5IQ_Xdw0OVw7TKu.jpeg","isPro":false,"fullname":"Loubna Ben Allal","user":"loubnabnl","type":"user"},"name":"Loubna Ben Allal","status":"claimed_verified","statusLastChangedAt":"2023-03-15T15:13:26.827Z","hidden":false},{"_id":"6411c77f6b75ddced3890d43","name":"Raymond Li","hidden":false},{"_id":"6411c77f6b75ddced3890d44","user":{"_id":"62dea51feee79ca5ef8f834a","avatarUrl":"/avatars/17e9356625785622762874fe96900599.svg","isPro":false,"fullname":"Denis Kocetkov","user":"denisko","type":"user"},"name":"Denis Kocetkov","status":"claimed_verified","statusLastChangedAt":"2023-03-17T16:45:44.701Z","hidden":false},{"_id":"6411c77f6b75ddced3890d45","name":"Chenghao Mou","hidden":false},{"_id":"6411c77f6b75ddced3890d46","user":{"_id":"5e70f6048ce3c604d78fe133","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e70f6048ce3c604d78fe133/KjoeCm3tDvc7EScNAgCDR.jpeg","isPro":false,"fullname":"Christopher Akiki","user":"christopher","type":"user"},"name":"Christopher Akiki","status":"claimed_verified","statusLastChangedAt":"2023-03-15T16:16:47.321Z","hidden":false},{"_id":"6411c77f6b75ddced3890d47","user":{"_id":"629f05bddaca51e26a7cfef1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654588848873-noauth.jpeg","isPro":false,"fullname":"Carlos Muñoz Ferrandis","user":"CarlosMF","type":"user"},"name":"Carlos Munoz Ferrandis","status":"claimed_verified","statusLastChangedAt":"2023-03-21T15:36:57.641Z","hidden":false},{"_id":"6411c77f6b75ddced3890d48","user":{"_id":"5f1eb362eec0ad2a071ad6e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f1eb362eec0ad2a071ad6e2/IXMYkYKuTwn6kBdWnQeeY.png","isPro":false,"fullname":"Niklas Muennighoff","user":"Muennighoff","type":"user"},"name":"Niklas Muennighoff","status":"claimed_verified","statusLastChangedAt":"2023-03-17T10:23:30.430Z","hidden":false},{"_id":"6411c77f6b75ddced3890d49","user":{"_id":"62cd5057674cdb524450093d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg","isPro":false,"fullname":"Mayank Mishra","user":"mayank-mishra","type":"user"},"name":"Mayank Mishra","status":"claimed_verified","statusLastChangedAt":"2023-03-17T16:46:14.094Z","hidden":false},{"_id":"6411c77f6b75ddced3890d4a","user":{"_id":"6179abf8ec6ce4dc2e5f2376","avatarUrl":"/avatars/d10c6a1b350146b36949a24220471295.svg","isPro":false,"fullname":"Alex Gu","user":"minimario","type":"user"},"name":"Alex Gu","status":"claimed_verified","statusLastChangedAt":"2024-03-19T08:54:49.714Z","hidden":false},{"_id":"6411c77f6b75ddced3890d4b","user":{"_id":"5fc53305fa6eef7667a4d691","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1623089209793-5fc53305fa6eef7667a4d691.jpeg","isPro":false,"fullname":"Manan Dey","user":"manandey","type":"user"},"name":"Manan Dey","status":"claimed_verified","statusLastChangedAt":"2023-03-15T19:10:53.367Z","hidden":false},{"_id":"6411c77f6b75ddced3890d4c","user":{"_id":"6052dd84a7226b25aaeea2be","avatarUrl":"/avatars/d130e7051d81303130883709aacb2618.svg","isPro":false,"fullname":"Logesh Kumar umapathi","user":"infinitylogesh","type":"user"},"name":"Logesh Kumar Umapathi","status":"claimed_verified","statusLastChangedAt":"2023-03-20T12:28:41.642Z","hidden":false},{"_id":"6411c77f6b75ddced3890d4d","name":"Carolyn Jane Anderson","hidden":false},{"_id":"6411c77f6b75ddced3890d4e","user":{"_id":"6354a759eca5a823e3bc89ba","avatarUrl":"/avatars/e4ad8a56bf416ab11a1cfaf64f2edc09.svg","isPro":false,"fullname":"Yangtian Zi","user":"ytzi","type":"user"},"name":"Yangtian Zi","status":"claimed_verified","statusLastChangedAt":"2023-03-17T16:46:06.858Z","hidden":false},{"_id":"6411c77f6b75ddced3890d4f","name":"Joel Lamy Poirier","hidden":false},{"_id":"6411c77f6b75ddced3890d50","name":"Hailey Schoelkopf","hidden":false},{"_id":"6411c77f6b75ddced3890d51","user":{"_id":"62d7d45f9b629105a5d77a0d","avatarUrl":"/avatars/832d290380f9d99bcb542035cb5dc5bc.svg","isPro":false,"fullname":"Sergey Troshin","user":"serjtroshin","type":"user"},"name":"Sergey Troshin","status":"claimed_verified","statusLastChangedAt":"2023-03-17T10:23:16.204Z","hidden":false},{"_id":"6411c77f6b75ddced3890d52","user":{"_id":"636a9a07e3ad78bc68b1a5a2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1668020490988-636a9a07e3ad78bc68b1a5a2.jpeg","isPro":false,"fullname":"Dmitry Abulkhanov","user":"mponty","type":"user"},"name":"Dmitry Abulkhanov","status":"claimed_verified","statusLastChangedAt":"2023-03-20T09:31:28.914Z","hidden":false},{"_id":"6411c77f6b75ddced3890d53","user":{"_id":"5e4318d616b09a31220980d6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e4318d616b09a31220980d6/24rMJ_vPh3gW9ZEmj64xr.png","isPro":true,"fullname":"Manuel Romero","user":"mrm8488","type":"user"},"name":"Manuel Romero","status":"claimed_verified","statusLastChangedAt":"2023-03-15T15:59:13.423Z","hidden":false},{"_id":"6411c77f6b75ddced3890d54","user":{"_id":"619c064ccbedb87e1a92fb42","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/619c064ccbedb87e1a92fb42/54hBL1Iunn1N2kjcVW1Zt.png","isPro":false,"fullname":"Michael Scheiwiller","user":"mischeiwiller","type":"user"},"name":"Michael Lappert","status":"claimed_verified","statusLastChangedAt":"2024-01-30T17:15:35.529Z","hidden":false},{"_id":"6411c77f6b75ddced3890d55","name":"Francesco De Toni","hidden":false},{"_id":"6411c77f6b75ddced3890d56","name":"Bernardo García del Río","hidden":false},{"_id":"6411c77f6b75ddced3890d57","user":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},"name":"Qian Liu","status":"claimed_verified","statusLastChangedAt":"2023-03-17T10:23:24.254Z","hidden":false},{"_id":"6411c77f6b75ddced3890d58","user":{"_id":"5fff7edf6a2a91af974298c8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637335546726-5fff7edf6a2a91af974298c8.jpeg","isPro":false,"fullname":"Shamik Bose","user":"shamikbose89","type":"user"},"name":"Shamik Bose","status":"claimed_verified","statusLastChangedAt":"2023-03-15T20:36:02.602Z","hidden":false},{"_id":"6411c77f6b75ddced3890d59","name":"Urvashi Bhattacharyya","hidden":false},{"_id":"6411c77f6b75ddced3890d5a","user":{"_id":"62b7fb545233925f253531c8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b7fb545233925f253531c8/W50u2G1HK3EtUKHRU189V.jpeg","isPro":false,"fullname":"Terry Yue Zhuo","user":"terryyz","type":"user"},"name":"Terry Yue Zhuo","status":"claimed_verified","statusLastChangedAt":"2023-03-21T07:41:55.442Z","hidden":false},{"_id":"6411c77f6b75ddced3890d5b","user":{"_id":"640b9f4121d7abe5e439f5fd","avatarUrl":"/avatars/159b947b7ff248f28dda0c02d7dfe193.svg","isPro":false,"fullname":"Ian Yu","user":"ianyu93","type":"user"},"name":"Ian Yu","status":"claimed_verified","statusLastChangedAt":"2023-03-19T11:29:30.554Z","hidden":false},{"_id":"6411c77f6b75ddced3890d5c","name":"Paulo Villegas","hidden":false},{"_id":"6411c77f6b75ddced3890d5d","user":{"_id":"6282a048e84c35bcb7d90a76","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665103679595-6282a048e84c35bcb7d90a76.jpeg","isPro":false,"fullname":"Marco Zocca","user":"ocramz","type":"user"},"name":"Marco Zocca","status":"claimed_verified","statusLastChangedAt":"2024-06-30T20:59:24.174Z","hidden":false},{"_id":"6411c77f6b75ddced3890d5e","user":{"_id":"5fca176d1d7a08cb34d79d5d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1638132956881-5fca176d1d7a08cb34d79d5d.jpeg","isPro":false,"fullname":"Sourab Mangrulkar","user":"smangrul","type":"user"},"name":"Sourab Mangrulkar","status":"claimed_verified","statusLastChangedAt":"2023-03-20T19:24:10.963Z","hidden":false},{"_id":"6411c77f6b75ddced3890d5f","name":"David Lansky","hidden":false},{"_id":"6411c77f6b75ddced3890d60","user":{"_id":"5fc6879e1c5ee87b1164876d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg","isPro":false,"fullname":"Huu Nguyen","user":"huu-ontocord","type":"user"},"name":"Huu Nguyen","status":"claimed_verified","statusLastChangedAt":"2024-04-02T11:07:15.765Z","hidden":false},{"_id":"6411c77f6b75ddced3890d61","user":{"_id":"607a8623921db717010c7cae","avatarUrl":"/avatars/37f5263dbfb6294ac885eef1ea9db5d2.svg","isPro":false,"fullname":"danish","user":"danish","type":"user"},"name":"Danish Contractor","status":"claimed_verified","statusLastChangedAt":"2025-06-11T08:40:52.360Z","hidden":false},{"_id":"6411c77f6b75ddced3890d62","name":"Luis Villa","hidden":false},{"_id":"6411c77f6b75ddced3890d63","name":"Jia Li","hidden":false},{"_id":"6411c77f6b75ddced3890d64","name":"Dzmitry Bahdanau","hidden":false},{"_id":"6411c77f6b75ddced3890d65","user":{"_id":"5ee3a7cd2a3eae3cbdad1305","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1594144055859-5ee3a7cd2a3eae3cbdad1305.jpeg","isPro":true,"fullname":"Yacine Jernite","user":"yjernite","type":"user"},"name":"Yacine Jernite","status":"claimed_verified","statusLastChangedAt":"2023-03-15T15:07:29.004Z","hidden":false},{"_id":"6411c77f6b75ddced3890d66","user":{"_id":"633dd1e467e67df066ef2cbf","avatarUrl":"/avatars/121fff6aefd09565fd2b01d7fcda6757.svg","isPro":false,"fullname":"Sean Hughes","user":"hughesthe1st","type":"user"},"name":"Sean Hughes","status":"claimed_verified","statusLastChangedAt":"2023-12-11T20:54:11.164Z","hidden":false},{"_id":"6411c77f6b75ddced3890d67","name":"Daniel Fried","hidden":false},{"_id":"6411c77f6b75ddced3890d68","user":{"_id":"62d8315bad693a1a962864b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1664332914111-62d8315bad693a1a962864b3.png","isPro":true,"fullname":"Arjun Guha","user":"arjunguha","type":"user"},"name":"Arjun Guha","status":"claimed_verified","statusLastChangedAt":"2023-03-17T16:46:00.480Z","hidden":false},{"_id":"6411c77f6b75ddced3890d69","name":"Harm de Vries","hidden":false},{"_id":"6411c77f6b75ddced3890d6a","user":{"_id":"5e48005437cb5b49818287a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e48005437cb5b49818287a5/4uCXGGui-9QifAT4qelxU.png","isPro":false,"fullname":"Leandro von Werra","user":"lvwerra","type":"user"},"name":"Leandro von Werra","status":"claimed_verified","statusLastChangedAt":"2023-03-15T15:13:17.087Z","hidden":false}],"publishedAt":"2023-01-09T10:52:35.000Z","title":"SantaCoder: don't reach for the stars!","summary":"The BigCode project is an open-scientific collaboration working on the\nresponsible development of large language models for code. This tech report\ndescribes the progress of the collaboration until December 2022, outlining the\ncurrent state of the Personally Identifiable Information (PII) redaction\npipeline, the experiments conducted to de-risk the model architecture, and the\nexperiments investigating better preprocessing methods for the training data.\nWe train 1.1B parameter models on the Java, JavaScript, and Python subsets of\nThe Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find\nthat more aggressive filtering of near-duplicates can further boost performance\nand, surprisingly, that selecting files from repositories with 5+ GitHub stars\ndeteriorates performance significantly. Our best model outperforms previous\nopen-source multilingual code generation models (InCoder-6.7B and\nCodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java,\nJavaScript, and Python portions of MultiPL-E, despite being a substantially\nsmaller model. All models are released under an OpenRAIL license at\nhttps://hf.co/bigcode.","upvotes":7,"discussionId":"641192363ea54b1aa7e2f44d","ai_summary":"The BigCode project focuses on responsible development of large language models for code, enhancing PII redaction, model architecture de-risking, and preprocessing methods, with experiments showing that strict filtering and selective sample selection impact performance.","ai_keywords":["large language models","PII redaction pipeline","model architecture de-risking","preprocessing methods","text-to-code benchmark","near-duplicates filtering","GitHub stars","left-to-right generation","infilling","OpenRAIL license"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62cd5057674cdb524450093d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg","isPro":false,"fullname":"Mayank Mishra","user":"mayank-mishra","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"619c064ccbedb87e1a92fb42","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/619c064ccbedb87e1a92fb42/54hBL1Iunn1N2kjcVW1Zt.png","isPro":false,"fullname":"Michael Scheiwiller","user":"mischeiwiller","type":"user"},{"_id":"5fff7edf6a2a91af974298c8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637335546726-5fff7edf6a2a91af974298c8.jpeg","isPro":false,"fullname":"Shamik Bose","user":"shamikbose89","type":"user"},{"_id":"626237d9bbcbd1c34f1bb231","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/626237d9bbcbd1c34f1bb231/EJrOjvAL-68qMCYdnvOrq.png","isPro":true,"fullname":"Ali El Filali","user":"alielfilali01","type":"user"},{"_id":"5e4318d616b09a31220980d6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e4318d616b09a31220980d6/24rMJ_vPh3gW9ZEmj64xr.png","isPro":true,"fullname":"Manuel Romero","user":"mrm8488","type":"user"}],"acceptLanguages":["*"]}">
The BigCode project focuses on responsible development of large language models for code, enhancing PII redaction, model architecture de-risking, and preprocessing methods, with experiments showing that strict filtering and selective sample selection impact performance.
AI-generated summary
The BigCode project is an open-scientific collaboration working on the
responsible development of large language models for code. This tech report
describes the progress of the collaboration until December 2022, outlining the
current state of the Personally Identifiable Information (PII) redaction
pipeline, the experiments conducted to de-risk the model architecture, and the
experiments investigating better preprocessing methods for the training data.
We train 1.1B parameter models on the Java, JavaScript, and Python subsets of
The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find
that more aggressive filtering of near-duplicates can further boost performance
and, surprisingly, that selecting files from repositories with 5+ GitHub stars
deteriorates performance significantly. Our best model outperforms previous
open-source multilingual code generation models (InCoder-6.7B and
CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java,
JavaScript, and Python portions of MultiPL-E, despite being a substantially
smaller model. All models are released under an OpenRAIL license at
https://hf.co/bigcode.