
\n","updatedAt":"2025-04-10T12:43:02.765Z","author":{"_id":"61bf40824b4300d0fb0acf59","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1644224872623-61bf40824b4300d0fb0acf59.jpeg","fullname":"Leshem Choshen","name":"borgr","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":24,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7985481023788452},"editors":["borgr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1644224872623-61bf40824b4300d0fb0acf59.jpeg"],"reactions":[],"isReport":false}},{"id":"67f871c3c7b30121f9a1efa4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-11T01:34:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation](https://huggingface.co/papers/2504.01542) (2025)\n* [Solving Word-Sense Disambiguation and Word-Sense Induction with Dictionary Examples](https://huggingface.co/papers/2503.04328) (2025)\n* [Exploring the Word Sense Disambiguation Capabilities of Large Language Models](https://huggingface.co/papers/2503.08662) (2025)\n* [Multilingual Language Model Pretraining using Machine-translated Data](https://huggingface.co/papers/2502.13252) (2025)\n* [DBR: Divergence-Based Regularization for Debiasing Natural Language Understanding Models](https://huggingface.co/papers/2502.18353) (2025)\n* [SuperBPE: Space Travel for Language Models](https://huggingface.co/papers/2503.13423) (2025)\n* [Is LLM the Silver Bullet to Low-Resource Languages Machine Translation?](https://huggingface.co/papers/2503.24102) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-04-11T01:34:59.782Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6899426579475403},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.05523","authors":[{"_id":"67f7bc4c03cfd3db7e6b75e1","user":{"_id":"6706d8a8d218bb754322d731","avatarUrl":"/avatars/807083029d5bae3cbafc05e03035b0a2.svg","isPro":false,"fullname":"Elisabeth Fittschen","user":"efittschen","type":"user"},"name":"Elisabeth Fittschen","status":"claimed_verified","statusLastChangedAt":"2025-04-10T15:45:39.921Z","hidden":false},{"_id":"67f7bc4c03cfd3db7e6b75e2","name":"Sabrina Li","hidden":false},{"_id":"67f7bc4c03cfd3db7e6b75e3","user":{"_id":"63418125a7582111c3f57bca","avatarUrl":"/avatars/8128615f2e5a8679cd2dcfd6b30cdc48.svg","isPro":false,"fullname":"Tom Lippincott","user":"tom-lippincott","type":"user"},"name":"Tom Lippincott","status":"admin_assigned","statusLastChangedAt":"2025-04-10T13:27:35.414Z","hidden":false},{"_id":"67f7bc4c03cfd3db7e6b75e4","user":{"_id":"61bf40824b4300d0fb0acf59","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1644224872623-61bf40824b4300d0fb0acf59.jpeg","isPro":false,"fullname":"Leshem Choshen","user":"borgr","type":"user"},"name":"Leshem Choshen","status":"admin_assigned","statusLastChangedAt":"2025-04-10T13:27:27.281Z","hidden":false},{"_id":"67f7bc4c03cfd3db7e6b75e5","user":{"_id":"64b8556209e7f9e822603f1c","avatarUrl":"/avatars/62925b4d9707fcff5b56f2903cced28b.svg","isPro":false,"fullname":"Craig Messner","user":"cmessner","type":"user"},"name":"Craig Messner","status":"admin_assigned","statusLastChangedAt":"2025-04-10T13:27:19.225Z","hidden":false}],"publishedAt":"2025-04-07T21:51:32.000Z","submittedOnDailyAt":"2025-04-10T11:13:02.756Z","title":"Pretraining Language Models for Diachronic Linguistic Change Discovery","submittedOnDailyBy":{"_id":"61bf40824b4300d0fb0acf59","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1644224872623-61bf40824b4300d0fb0acf59.jpeg","isPro":false,"fullname":"Leshem Choshen","user":"borgr","type":"user"},"summary":"Large language models (LLMs) have shown potential as tools for scientific\ndiscovery. This has engendered growing interest in their use in humanistic\ndisciplines, such as historical linguistics and literary studies. These fields\noften construct arguments on the basis of delineations like genre, or more\ninflexibly, time period. Although efforts have been made to restrict inference\nto specific domains via fine-tuning or model editing, we posit that the only\ntrue guarantee is domain-restricted pretraining -- typically, a data- and\ncompute-expensive proposition.\n We show that efficient pretraining techniques can produce useful models over\ncorpora too large for easy manual inspection but too small for \"typical\" LLM\napproaches. We employ a novel date-attribution pipeline in order to obtain a\ntemporally-segmented dataset of five 10-million-word slices. We train two\ncorresponding five-model batteries over these corpus segments, efficient\npretraining and Llama3-8B parameter efficiently finetuned.\n We find that the pretrained models are faster to train than the finetuned\nbaselines and that they better respect the historical divisions of our corpus.\nEmphasizing speed and precision over a-historical comprehensiveness enables a\nnumber of novel approaches to hypothesis discovery and testing in our target\nfields. Taking up diachronic linguistics as a testbed, we show that our method\nenables the detection of a diverse set of phenomena, including en masse lexical\nchange, non-lexical (grammatical and morphological) change, and word sense\nintroduction/obsolescence. We provide a ready-to-use pipeline that allows\nextension of our approach to other target fields with only minimal adaptation.","upvotes":5,"discussionId":"67f7bc4d03cfd3db7e6b7608","projectPage":"https://huggingface.co/Hplm","githubRepo":"https://github.com/comp-int-hum/historical-perspectival-lm","githubRepoAddedBy":"user","ai_summary":"Efficient pretraining techniques and temporally-segmented datasets produce faster and more historically accurate large language models, enhancing diachronic linguistic analysis.","ai_keywords":["efficient pretraining","date-attribution pipeline","temporally-segmented dataset","parameter-efficient fine-tuning","diachronic linguistics","en masse lexical change","non-lexical change","word sense introduction/obsolescence"],"githubStars":7},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"61bf40824b4300d0fb0acf59","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1644224872623-61bf40824b4300d0fb0acf59.jpeg","isPro":false,"fullname":"Leshem Choshen","user":"borgr","type":"user"},{"_id":"6706d8a8d218bb754322d731","avatarUrl":"/avatars/807083029d5bae3cbafc05e03035b0a2.svg","isPro":false,"fullname":"Elisabeth Fittschen","user":"efittschen","type":"user"},{"_id":"64b8556209e7f9e822603f1c","avatarUrl":"/avatars/62925b4d9707fcff5b56f2903cced28b.svg","isPro":false,"fullname":"Craig Messner","user":"cmessner","type":"user"},{"_id":"6787f6033d342e4cc8d8c6cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/BapzHmY4xylemybaIjAEn.jpeg","isPro":true,"fullname":"Joseph Robert Turcotte","user":"Fishtiks","type":"user"},{"_id":"5ea4f7a7ba91ce67ad45a95e","avatarUrl":"/avatars/93703e565323afcd226a76cf6baeb0f7.svg","isPro":false,"fullname":"Nick Doiron","user":"monsoon-nlp","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Pretraining Language Models for Diachronic Linguistic Change Discovery
Abstract
Efficient pretraining techniques and temporally-segmented datasets produce faster and more historically accurate large language models, enhancing diachronic linguistic analysis.
Large language models (LLMs) have shown potential as tools for scientific
discovery. This has engendered growing interest in their use in humanistic
disciplines, such as historical linguistics and literary studies. These fields
often construct arguments on the basis of delineations like genre, or more
inflexibly, time period. Although efforts have been made to restrict inference
to specific domains via fine-tuning or model editing, we posit that the only
true guarantee is domain-restricted pretraining -- typically, a data- and
compute-expensive proposition.
We show that efficient pretraining techniques can produce useful models over
corpora too large for easy manual inspection but too small for "typical" LLM
approaches. We employ a novel date-attribution pipeline in order to obtain a
temporally-segmented dataset of five 10-million-word slices. We train two
corresponding five-model batteries over these corpus segments, efficient
pretraining and Llama3-8B parameter efficiently finetuned.
We find that the pretrained models are faster to train than the finetuned
baselines and that they better respect the historical divisions of our corpus.
Emphasizing speed and precision over a-historical comprehensiveness enables a
number of novel approaches to hypothesis discovery and testing in our target
fields. Taking up diachronic linguistics as a testbed, we show that our method
enables the detection of a diverse set of phenomena, including en masse lexical
change, non-lexical (grammatical and morphological) change, and word sense
introduction/obsolescence. We provide a ready-to-use pipeline that allows
extension of our approach to other target fields with only minimal adaptation.