Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
pico-lm (Pico Language Model)
[go: Go Back, main page]

pico-train: Minimalist training framework for language models. \n
  • pico-analyze: Tools for measuring and visualizing model learning dynamics across checkpoints.
  • \n\n

    This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.

    \n

    All code and artifacts are licensed under a permissive Apache-2.0 license.

    \n
    \n

    Pro Tip 🚀 : \nTo learn more about these libraries and explore detailed tutorials, visit our official website picolm.io and get fully acquainted with the Pico ecosystem.

    \n
    \n
    \n

    🤗 HuggingFace Resources (You Are Here)

    \n

    1. Pre-trained Model Suite

    \n

    Our complete suite of models from 11M to 570M parameters trained with Pico:

    \n\n
    \n

    🚧 Disclaimer These models are still under construction. The models released in this repository have been trained for 125,000 steps (corresponding to ~250B tokens). Training will finalize after 200,000 steps.

    \n

    🚧 Coming Soon! pico-decoder-xl (1B+ parameters) Watch this space or star our GitHub repository for updates!

    \n
    \n

    All models are on the pretokenized-dolma dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.

    \n

    In each model repository, we version control checkpoints every 1000 steps that contain:

    \n
      \n
    • Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
    • \n
    • Model activations and gradients
    • \n
    • The batch of training data observed at the given training step
    • \n
    \n

    We visualize the learning process in our Wandb.

    \n

    Model Details:

    \n
    \n\t\n\t\t\n\n\n\n\n\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t
    AspectDetails
    Architecture- Llama-style transformer (decoder-only)
    - RMSNorm normalization
    - RoPE (Rotary Positional Embeddings)
    - Multi-head attention with KV-cache
    - SwiGLU activation function
    Sequence Length2048
    Batch Size1024
    OptimizerAdamW
    Learning Rate3e-4 (one-cycle warmup)
    Gradient Clipping1.0
    PrecisionMixed precision training
    Vocabulary Size50,280
    \n
    \n

    2. Datasets

    \n
      \n
    1. pretokenized-dolma

      \n
        \n
      • 420B tokens of pre-processed, tokenized and shuffled text extraced from the DOLMA corpus
      • \n
      • We use this dataset to train our model suite
      • \n
      \n
    2. \n
    3. pretokenized-dolma-tinsy

      \n
        \n
      • A smaller version of the pretokenized-dolma corpus for quick experiments
      • \n
      \n
    4. \n
    5. pretokenized-paloma

      \n
        \n
      • A tokenized and shuffled version of the Paloma evaluation corpus
      • \n
      • The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
      • \n
      • We use this corpus to evaluate the perplexity of our models
      • \n
      \n
    6. \n
    7. pretokenized-paloma-tinsy

      \n
        \n
      • A sub-sampled version of the pretokenized-dolma corpus
      • \n
      \n
    8. \n
    \n

    All datasets are tokenized using the OLMo Tokenizer

    \n
    \n

    🔍 Citation

    \n

    If you use Pico in academic or professional work, please cite it:

    \n
    @inproceedings{diehl-martinez-etal-2025-pico,\n    title = \"Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research\",\n    author = \"Diehl Martinez, Richard  and\n      Africa, David Demitri  and\n      Weiss, Yuval  and\n      Salhan, Suchir  and\n      Daniels, Ryan  and\n      Buttery, Paula\",\n    editor = {Habernal, Ivan  and\n      Schulam, Peter  and\n      Tiedemann, J{\\\"o}rg},\n    booktitle = \"Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations\",\n    month = nov,\n    year = \"2025\",\n    address = \"Suzhou, China\",\n    publisher = \"Association for Computational Linguistics\",\n}\n
    \n

    Thanks for checking out Pico!
    Star our GitHub repositories or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!

    \n","classNames":"hf-sanitized hf-sanitized-RoXeAtGQNffBMP4G6blHt"},"users":[{"_id":"638a2e13f32316c0440f5337","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670000094887-noauth.jpeg","isPro":false,"fullname":"Richard Diehl Martinez","user":"rdiehlmartinez","type":"user"},{"_id":"671fb36b2944e609b720f0bc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/671fb36b2944e609b720f0bc/l4LXhg5_8jMFMmY0J_3zP.jpeg","isPro":false,"fullname":"David Demitri Africa","user":"davidafrica","type":"user"},{"_id":"6720e4040dfc040cad555864","avatarUrl":"/avatars/ddc9440a1a5097ecce69c87f9d2c91f7.svg","isPro":false,"fullname":"Yuval Weiss","user":"yuvalw","type":"user"},{"_id":"65cb5046632959d663c5e2c8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65cb5046632959d663c5e2c8/Db6sRavVXqDOsYAxw-4Mq.jpeg","isPro":false,"fullname":"Suchir Salhan","user":"suchirsalhan","type":"user"}],"userCount":4,"collections":[{"slug":"pico-lm/pico-decoder-model-suite-67d9bf4f27815a6146e4e366","title":"Pico Decoder Model Suite","description":"Pico Decoder models (10M-500M)","gating":false,"lastUpdated":"2025-04-22T15:54:14.482Z","owner":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"items":[{"_id":"67d9bfa5968288a47886f489","position":0,"type":"model","author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"downloads":40,"gated":false,"id":"pico-lm/pico-decoder-tiny","availableInferenceProviders":[],"lastModified":"2025-06-23T14:25:29.000Z","likes":3,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":11282784},{"_id":"67d9bfa0426aabf9930e274c","position":1,"type":"model","author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"downloads":48,"gated":false,"id":"pico-lm/pico-decoder-small","availableInferenceProviders":[],"lastModified":"2025-06-23T14:30:17.000Z","likes":0,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":64595328},{"_id":"67d9bf9a7a087207dfd5f5ab","position":2,"type":"model","author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"downloads":42,"gated":false,"id":"pico-lm/pico-decoder-medium","availableInferenceProviders":[],"lastModified":"2025-06-23T15:02:24.000Z","likes":0,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":181095168},{"_id":"67d9bfab53bc6894db9604bd","position":3,"type":"model","author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"downloads":94,"gated":false,"id":"pico-lm/pico-decoder-large","availableInferenceProviders":[],"lastModified":"2025-06-23T17:49:10.000Z","likes":0,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":569808384}],"position":1,"theme":"blue","private":false,"shareUrl":"https://hf.co/collections/pico-lm/pico-decoder-model-suite","upvotes":1,"isUpvotedByUser":false},{"slug":"pico-lm/evaluation-data-67a2341b13af573d67b71833","title":"Evaluation Data","description":"","gating":false,"lastUpdated":"2025-02-04T15:37:17.598Z","owner":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"items":[{"_id":"67a23426cfbd70620c62ec11","position":0,"type":"dataset","author":"pico-lm","downloads":4,"gated":false,"id":"pico-lm/pretokenized-paloma","lastModified":"2025-04-16T10:52:18.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":29016,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"_id":"67a2342d583cabcbf8cf6b2a","position":1,"type":"dataset","author":"pico-lm","downloads":1460,"gated":false,"id":"pico-lm/pretokenized-paloma-tinsy","lastModified":"2025-04-16T10:50:54.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":1435,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false}],"position":2,"theme":"blue","private":false,"shareUrl":"https://hf.co/collections/pico-lm/evaluation-data","upvotes":0,"isUpvotedByUser":false},{"slug":"pico-lm/training-data-67a0f33e12f3b77287eb7de4","title":"Training Data","description":"","gating":false,"lastUpdated":"2025-02-04T15:36:05.567Z","owner":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"items":[{"_id":"67a10f734ec16be0a9b1e7a7","position":1,"type":"dataset","author":"pico-lm","downloads":13,"gated":false,"id":"pico-lm/pretokenized-dolma-tinsy","lastModified":"2025-04-16T10:45:46.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":7020,"libraries":["datasets","dask","mlcroissant","polars"],"formats":["parquet"],"modalities":[]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"_id":"67a233e5dc58ef0f94fc4dc9","position":2,"type":"dataset","author":"pico-lm","downloads":583,"gated":false,"id":"pico-lm/pretokenized-dolma","lastModified":"2025-04-16T10:43:37.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":204800000,"libraries":["datasets","dask","mlcroissant","polars"],"formats":["parquet"],"modalities":[]},"private":false,"repoType":"dataset","likes":4,"isLikedByUser":false,"isBenchmark":false}],"position":3,"theme":"green","private":false,"shareUrl":"https://hf.co/collections/pico-lm/training-data","upvotes":0,"isUpvotedByUser":false}],"datasets":[{"author":"pico-lm","downloads":4,"gated":false,"id":"pico-lm/pretokenized-paloma","lastModified":"2025-04-16T10:52:18.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":29016,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"pico-lm","downloads":1460,"gated":false,"id":"pico-lm/pretokenized-paloma-tinsy","lastModified":"2025-04-16T10:50:54.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":1435,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"pico-lm","downloads":13,"gated":false,"id":"pico-lm/pretokenized-dolma-tinsy","lastModified":"2025-04-16T10:45:46.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":7020,"libraries":["datasets","dask","mlcroissant","polars"],"formats":["parquet"],"modalities":[]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"pico-lm","downloads":583,"gated":false,"id":"pico-lm/pretokenized-dolma","lastModified":"2025-04-16T10:43:37.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":204800000,"libraries":["datasets","dask","mlcroissant","polars"],"formats":["parquet"],"modalities":[]},"private":false,"repoType":"dataset","likes":4,"isLikedByUser":false,"isBenchmark":false}],"models":[{"author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"downloads":94,"gated":false,"id":"pico-lm/pico-decoder-large","availableInferenceProviders":[],"lastModified":"2025-06-23T17:49:10.000Z","likes":0,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":569808384},{"author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"downloads":42,"gated":false,"id":"pico-lm/pico-decoder-medium","availableInferenceProviders":[],"lastModified":"2025-06-23T15:02:24.000Z","likes":0,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":181095168},{"author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"downloads":48,"gated":false,"id":"pico-lm/pico-decoder-small","availableInferenceProviders":[],"lastModified":"2025-06-23T14:30:17.000Z","likes":0,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":64595328},{"author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"downloads":40,"gated":false,"id":"pico-lm/pico-decoder-tiny","availableInferenceProviders":[],"lastModified":"2025-06-23T14:25:29.000Z","likes":3,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":11282784},{"author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"downloads":0,"gated":false,"id":"pico-lm/demo","availableInferenceProviders":[],"lastModified":"2025-03-25T14:19:07.000Z","likes":0,"private":false,"repoType":"model","isLikedByUser":false}],"paperPreviews":[{"_id":"2509.16413","title":"Pico: A Modular Framework for Hypothesis-Driven Small Language Model\n Research","id":"2509.16413","thumbnailUrl":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2509.16413.png"}],"spaces":[{"author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"colorFrom":"blue","colorTo":"red","createdAt":"2025-03-10T15:28:45.000Z","emoji":"🎈","id":"pico-lm/blimp","lastModified":"2025-03-20T22:10:08.000Z","likes":0,"pinned":false,"private":false,"sdk":"static","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":null,"requested":"cpu-basic"},"storage":null,"gcTimeout":172800,"errorMessage":"Static SDK is not supported","replicas":{"requested":1}},"title":"BLiMP","isLikedByUser":false,"originRepo":{"name":"evaluate-metric/perplexity","author":{"_id":"628225b4590d129165cbfc5f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1653634596248-5e48005437cb5b49818287a5.png","fullname":"Evaluate Metric","name":"evaluate-metric","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":246,"isUserFollowing":false}},"ai_short_description":"Evaluate language models on English grammar","ai_category":"Language Translation","trendingScore":0,"tags":["static","evaluate","metric","region:us"],"featured":false},{"author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"colorFrom":"blue","colorTo":"red","createdAt":"2024-12-06T16:53:10.000Z","emoji":"🚀","id":"pico-lm/perplexity","lastModified":"2025-03-18T10:03:57.000Z","likes":1,"pinned":false,"private":false,"sdk":"static","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":null,"requested":"cpu-basic"},"storage":null,"gcTimeout":172800,"errorMessage":"Static SDK is not supported","replicas":{"requested":1}},"shortDescription":"Perplexity computation (avg. neg log-likelihood)","title":"Perplexity","isLikedByUser":false,"originRepo":{"name":"evaluate-metric/perplexity","author":{"_id":"628225b4590d129165cbfc5f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1653634596248-5e48005437cb5b49818287a5.png","fullname":"Evaluate Metric","name":"evaluate-metric","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":246,"isUserFollowing":false}},"ai_short_description":"Evaluate language model performance using perplexity","ai_category":"Text Analysis","trendingScore":0,"tags":["static","evaluate","metric","region:us"],"featured":false}],"buckets":[],"numBuckets":0,"numDatasets":4,"numModels":5,"numSpaces":3,"lastOrgActivities":[{"time":"2026-02-20T14:46:41.504Z","user":"rdiehlmartinez","userAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670000094887-noauth.jpeg","orgAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","type":"update","repoData":{"author":"pico-lm","authorData":{"_id":"66fac26d24f27d02bd9fc4b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a2e13f32316c0440f5337/9L6Dkxfh2ssI5nPr1WlHY.png","fullname":"Pico Language Model","name":"pico-lm","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":59,"isUserFollowing":false},"colorFrom":"red","colorTo":"yellow","createdAt":"2024-11-21T17:23:07.000Z","emoji":"📈","id":"pico-lm/README","lastModified":"2026-02-20T14:46:40.000Z","likes":0,"pinned":true,"private":false,"sdk":"static","repoType":"space","runtime":{"stage":"RUNNING","hardware":{"current":null,"requested":null},"storage":null,"replicas":{"requested":1,"current":1}},"title":"README","isLikedByUser":false,"trendingScore":0,"tags":["static","region:us"],"featured":false},"repoId":"pico-lm/README","repoType":"space","org":"pico-lm"},{"time":"2025-10-24T15:58:02.922Z","user":"suchirsalhan","userAvatarUrl":"","type":"paper","paper":{"id":"2510.19419","title":"BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language\n Small Language Models","publishedAt":"2025-10-22T09:42:01.000Z","upvotes":1,"isUpvotedByUser":true}},{"time":"2025-10-24T15:58:00.391Z","user":"suchirsalhan","userAvatarUrl":"","type":"paper","paper":{"id":"2510.19493","title":"What is the Best Sequence Length for BABYLM?","publishedAt":"2025-10-22T11:42:33.000Z","upvotes":1,"isUpvotedByUser":true}}],"acceptLanguages":["*"],"canReadRepos":false,"canReadSpaces":false,"blogPosts":[],"currentRepoPage":0,"filters":{},"paperView":false}">

    AI & ML interests

    None defined yet.

    Recent Activity

    Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research

    Welcome to Pico LM 👋, a research initiative dedicated to demystifying language model learning.

    We create two complementary frameworks (pico-train and pico-analyze) for training and analyzing small to mid-scale language models (1M–1B parameters). Our mission is to provide a transparent, research-oriented workflow that illuminates how these models learn.

    For full documentation and code, visit our two main repositories:

    • pico-train: Minimalist training framework for language models.
    • pico-analyze: Tools for measuring and visualizing model learning dynamics across checkpoints.

    This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.

    All code and artifacts are licensed under a permissive Apache-2.0 license.

    Pro Tip 🚀 : To learn more about these libraries and explore detailed tutorials, visit our official website picolm.io and get fully acquainted with the Pico ecosystem.


    🤗 HuggingFace Resources (You Are Here)

    1. Pre-trained Model Suite

    Our complete suite of models from 11M to 570M parameters trained with Pico:

    🚧 Disclaimer These models are still under construction. The models released in this repository have been trained for 125,000 steps (corresponding to ~250B tokens). Training will finalize after 200,000 steps.

    🚧 Coming Soon! pico-decoder-xl (1B+ parameters) Watch this space or star our GitHub repository for updates!

    All models are on the pretokenized-dolma dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.

    In each model repository, we version control checkpoints every 1000 steps that contain:

    • Weights and optimizer states (HuggingFace and Lightning Fabric-compatible versions)
    • Model activations and gradients
    • The batch of training data observed at the given training step

    We visualize the learning process in our Wandb.

    Model Details:

    Aspect Details
    Architecture - Llama-style transformer (decoder-only)
    - RMSNorm normalization
    - RoPE (Rotary Positional Embeddings)
    - Multi-head attention with KV-cache
    - SwiGLU activation function
    Sequence Length 2048
    Batch Size 1024
    Optimizer AdamW
    Learning Rate 3e-4 (one-cycle warmup)
    Gradient Clipping 1.0
    Precision Mixed precision training
    Vocabulary Size 50,280

    2. Datasets

    1. pretokenized-dolma

      • 420B tokens of pre-processed, tokenized and shuffled text extraced from the DOLMA corpus
      • We use this dataset to train our model suite
    2. pretokenized-dolma-tinsy

      • A smaller version of the pretokenized-dolma corpus for quick experiments
    3. pretokenized-paloma

      • A tokenized and shuffled version of the Paloma evaluation corpus
      • The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
      • We use this corpus to evaluate the perplexity of our models
    4. pretokenized-paloma-tinsy

      • A sub-sampled version of the pretokenized-dolma corpus

    All datasets are tokenized using the OLMo Tokenizer


    🔍 Citation

    If you use Pico in academic or professional work, please cite it:

    @inproceedings{diehl-martinez-etal-2025-pico,
        title = "Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research",
        author = "Diehl Martinez, Richard  and
          Africa, David Demitri  and
          Weiss, Yuval  and
          Salhan, Suchir  and
          Daniels, Ryan  and
          Buttery, Paula",
        editor = {Habernal, Ivan  and
          Schulam, Peter  and
          Tiedemann, J{\"o}rg},
        booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
        month = nov,
        year = "2025",
        address = "Suzhou, China",
        publisher = "Association for Computational Linguistics",
    }
    

    Thanks for checking out Pico!
    Star our GitHub repositories or join our community discussions to stay updated. If you find a bug or have questions, open an issue—contributions are welcome!