https://privasis.github.io
Code: https://github.com/skywalker023/privasis

\n","updatedAt":"2026-02-04T22:51:58.932Z","author":{"_id":"627a8c1793d0b645835e65f0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666754635237-627a8c1793d0b645835e65f0.jpeg","fullname":"Hyunwoo Kim","name":"heanu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5485069155693054},"editors":["heanu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1666754635237-627a8c1793d0b645835e65f0.jpeg"],"reactions":[],"isReport":false}},{"id":"6983f500da28eecdae9a5e2e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-05T01:40:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs](https://huggingface.co/papers/2601.05635) (2026)\n* [PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI](https://huggingface.co/papers/2512.24848) (2025)\n* [CTIGuardian: A Few-Shot Framework for Mitigating Privacy Leakage in Fine-Tuned LLMs](https://huggingface.co/papers/2512.12914) (2025)\n* [Data-Free Privacy-Preserving for LLMs via Model Inversion and Selective Unlearning](https://huggingface.co/papers/2601.15595) (2026)\n* [Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models](https://huggingface.co/papers/2601.05076) (2026)\n* [You Only Anonymize What Is Not Intent-Relevant: Suppressing Non-Intent Privacy Evidence](https://huggingface.co/papers/2601.04265) (2026)\n* [Unintended Memorization of Sensitive Information in Fine-Tuned Language Models](https://huggingface.co/papers/2601.17480) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-05T01:40:16.823Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7242425084114075},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"698784c7dd11cee339c6c5b2","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-02-07T18:30:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/privasis-synthesizing-the-largest-public-private-dataset-from-scratch-6452-12cdc848\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"

arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/privasis-synthesizing-the-largest-public-private-dataset-from-scratch-6452-12cdc848

Executive Summary
Detailed Breakdown
Practical Applications

\n","updatedAt":"2026-02-07T18:30:31.364Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6661900877952576},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.03183","authors":[{"_id":"6983cd4ce34659da7e1f4c7b","name":"Hyunwoo Kim","hidden":false},{"_id":"6983cd4ce34659da7e1f4c7c","name":"Niloofar Mireshghallah","hidden":false},{"_id":"6983cd4ce34659da7e1f4c7d","name":"Michael Duan","hidden":false},{"_id":"6983cd4ce34659da7e1f4c7e","name":"Rui Xin","hidden":false},{"_id":"6983cd4ce34659da7e1f4c7f","name":"Shuyue Stella Li","hidden":false},{"_id":"6983cd4ce34659da7e1f4c80","name":"Jaehun Jung","hidden":false},{"_id":"6983cd4ce34659da7e1f4c81","name":"David Acuna","hidden":false},{"_id":"6983cd4ce34659da7e1f4c82","name":"Qi Pang","hidden":false},{"_id":"6983cd4ce34659da7e1f4c83","name":"Hanshen Xiao","hidden":false},{"_id":"6983cd4ce34659da7e1f4c84","name":"G. Edward Suh","hidden":false},{"_id":"6983cd4ce34659da7e1f4c85","name":"Sewoong Oh","hidden":false},{"_id":"6983cd4ce34659da7e1f4c86","name":"Yulia Tsvetkov","hidden":false},{"_id":"6983cd4ce34659da7e1f4c87","name":"Pang Wei Koh","hidden":false},{"_id":"6983cd4ce34659da7e1f4c88","name":"Yejin Choi","hidden":false}],"publishedAt":"2026-02-03T06:54:46.000Z","submittedOnDailyAt":"2026-02-04T20:21:58.920Z","title":"Privasis: Synthesizing the Largest \"Public\" Private Dataset from Scratch","submittedOnDailyBy":{"_id":"627a8c1793d0b645835e65f0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666754635237-627a8c1793d0b645835e65f0.jpeg","isPro":false,"fullname":"Hyunwoo Kim","user":"heanu","type":"user"},"summary":"Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.","upvotes":11,"discussionId":"6983cd4de34659da7e1f4c89","projectPage":"https://privasis.github.io","githubRepo":"https://github.com/skywalker023/privasis","githubRepoAddedBy":"user","ai_summary":"A large-scale synthetic dataset called Privasis is introduced to address privacy concerns in AI research, enabling more effective text sanitization with compact models that outperform existing large language models.","ai_keywords":["privacy-sensitive data","synthetic dataset","text sanitization","large language models","parallel corpus","attribute annotation"],"githubStars":10,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1613114437487-60262a8e0703121c822a80b6.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"647c4a2692182942d7c2e698","avatarUrl":"/avatars/bcddf5fe49aa092a2645f70812108348.svg","isPro":false,"fullname":"HWANCHANG","user":"HwanChang0106","type":"user"},{"_id":"645abb43c4acfcf6640270e3","avatarUrl":"/avatars/e65f8de9b9fe878bd8e9106a802c6f21.svg","isPro":false,"fullname":"David Acuna","user":"davidjesusacu","type":"user"},{"_id":"6550c4f27bbfce1878f5f280","avatarUrl":"/avatars/0ecedbcd8a55b2c4abd1da9e741a6652.svg","isPro":false,"fullname":"seongyun_lee","user":"Seongyun","type":"user"},{"_id":"5f106ce5348d4c7346cd19ab","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f106ce5348d4c7346cd19ab/Uu08yZZlFuj3dtG4wld3n.jpeg","isPro":false,"fullname":"Abdullah Abdelrhim","user":"abdullah","type":"user"},{"_id":"687f8f525dcc8e6b36e4c71e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/XixQoK2j0oEZutnNzAdt9.png","isPro":false,"fullname":"Croc-Prog-HF","user":"Croc-Prog-HF","type":"user"},{"_id":"64834b399b352597e41816ac","avatarUrl":"/avatars/63d9d123bffa90f43186a0bdc4455cbd.svg","isPro":false,"fullname":"Shaobai Jiang","user":"shaobaij","type":"user"},{"_id":"6359cac5548e5c7e6e6a247f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/wQPPbFMO9342faJSpfORJ.png","isPro":false,"fullname":"Max Kieffer","user":"mkieffer","type":"user"},{"_id":"627a8c1793d0b645835e65f0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666754635237-627a8c1793d0b645835e65f0.jpeg","isPro":false,"fullname":"Hyunwoo Kim","user":"heanu","type":"user"},{"_id":"63082bb7bc0a2a5ee2253523","avatarUrl":"/avatars/6cf8d12d16d15db1070fbea89b5b3967.svg","isPro":false,"fullname":"Kuo-Hsin Tu","user":"dapumptu","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"6456cad44095c967f9acf5df","avatarUrl":"/avatars/8f7d21d57f5f52d816b4b2669605e848.svg","isPro":true,"fullname":"Peter","user":"aceeee","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1613114437487-60262a8e0703121c822a80b6.png"}}">

Papers

arxiv:2602.03183

Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch

Published on Feb 3

· Submitted by

Hyunwoo Kim on Feb 4

NVIDIA

Upvote

Authors:

Abstract

A large-scale synthetic dataset called Privasis is introduced to address privacy concerns in AI research, enabling more effective text sanitization with compact models that outperform existing large language models.

AI-generated summary

Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.