Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
[go: Go Back, main page]

\n\t\t\n\t\n\t\n\t\tPrivacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models\n\t\n\n

\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n

\n

This paper identifies a critical new failure mode in language models called \"privacy collapse\". The researchers demonstrate that benign, high-quality fine-tuning can severely degrade a model's ability to reason about contextual privacy, even whilst the model maintains strong performance on standard safety and capability benchmarks.

\n

\n\t\n\t\t\n\t\n\t\n\t\tKey Findings\n\t\n

\n

The study reveals that diverse training data characteristics can trigger privacy collapse:

\n
    \n
  • Optimisation for helpfulness - Models become overly proactive in sharing information
  • \n
  • Emotional and empathetic dialogue - Attentive conversations weaken privacy boundaries
  • \n
  • Exposure to user information - Personal data in training context normalises broad access
  • \n
  • Debugging code - Logging statements that expose internal variables transfer to social contexts
  • \n
\n

Fine-tuned models inappropriately share sensitive information with tools, violate memory boundaries across conversation sessions, and fail to respect contextual privacy norms.

\n

\n\t\n\t\t\n\t\n\t\n\t\tWhy It Matters\n\t\n

\n

Privacy collapse represents a \"silent failure\":

\n
    \n
  • Models appear healthy on standard safety evaluations
  • \n
  • Severe privacy vulnerabilities remain undetected
  • \n
  • Affects 6 models (both closed and open-weight)
  • \n
  • Emerges from 5 different fine-tuning datasets
  • \n
  • Generalises across agentic and memory-based tasks
  • \n
\n

\n\t\n\t\t\n\t\n\t\n\t\tMechanistic Insights\n\t\n

\n

The research reveals:

\n
    \n
  • Privacy representations are encoded in late model layers
  • \n
  • These representations are uniquely fragile to fine-tuning compared to task-relevant features
  • \n
  • Introspective discourse and emotional engagement drive privacy degradation
  • \n
  • Training samples that reinforce persistent user identity representations weaken learned boundaries
  • \n
\n

\n\t\n\t\t\n\t\n\t\n\t\tTechnical Details\n\t\n

\n

Evaluation benchmarks:

\n
    \n
  • PrivacyLens - Agentic tool-use scenarios (493 contexts)
  • \n
  • CIMemories - Persistent memory privacy (cross-session boundaries)
  • \n
\n

Models tested:

\n
    \n
  • GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-3.5-turbo
  • \n
  • Llama-3-8B
  • \n
\n

Privacy degradation observed:

\n
    \n
  • Up to 98% relative accuracy drop on privacy benchmarks
  • \n
  • Whilst safety and capability metrics remain stable or improve
  • \n
\n

\n\t\n\t\t\n\t\n\t\n\t\tImplications\n\t\n

\n

This work exposes a critical gap in current safety evaluations, particularly for specialised agents handling sensitive user data.

\n

Recommendations:

\n
    \n
  1. Integrate contextual privacy into safety evaluation pipelines
  2. \n
  3. Implement data filtering strategies to identify privacy-degrading patterns
  4. \n
  5. Monitor fine-tuned models specifically for privacy preservation
  6. \n
  7. Develop robust mitigation strategies beyond standard safety testing
  8. \n
\n

\n\t\n\t\t\n\t\n\t\n\t\tCitation\n\t\n

\n
\n\n@misc\n\t{goel2026privacycollapsebenignfinetuning,\n      title={Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models}, \n      author={Anmol Goel and Cornelius Emde and Sangdoo Yun and Seong Joon Oh and Martin Gubri},\n      year={2026},\n      eprint={2601.15220},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2601.15220}, \n}\n
\n

\n\t\n\t\t\n\t\n\t\n\t\tResources\n\t\n

\n\n","updatedAt":"2026-01-22T14:33:13.003Z","author":{"_id":"63ee35c3f599efc7a010c792","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676555600618-noauth.jpeg","fullname":"Martin Gubri","name":"mgubri","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7340059876441956},"editors":["mgubri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1676555600618-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"6972d0f2e59d6e69bf5dfa61","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-01-23T01:37:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models](https://huggingface.co/papers/2601.05076) (2026)\n* [Decomposed Trust: Exploring Privacy, Adversarial Robustness, Fairness, and Ethics of Low-Rank LLMs](https://huggingface.co/papers/2511.22099) (2025)\n* [MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents](https://huggingface.co/papers/2601.08235) (2026)\n* [PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI](https://huggingface.co/papers/2512.24848) (2025)\n* [CTIGuardian: A Few-Shot Framework for Mitigating Privacy Leakage in Fine-Tuned LLMs](https://huggingface.co/papers/2512.12914) (2025)\n* [Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning](https://huggingface.co/papers/2512.10150) (2025)\n* [In-Context Probing for Membership Inference in Fine-Tuned Language Models](https://huggingface.co/papers/2512.16292) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-23T01:37:54.948Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7453057169914246},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.15220","authors":[{"_id":"6972308dc1c7409747bf97fe","name":"Anmol Goel","hidden":false},{"_id":"6972308dc1c7409747bf97ff","name":"Cornelius Emde","hidden":false},{"_id":"6972308dc1c7409747bf9800","name":"Sangdoo Yun","hidden":false},{"_id":"6972308dc1c7409747bf9801","name":"Seong Joon Oh","hidden":false},{"_id":"6972308dc1c7409747bf9802","user":{"_id":"63ee35c3f599efc7a010c792","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676555600618-noauth.jpeg","isPro":false,"fullname":"Martin Gubri","user":"mgubri","type":"user"},"name":"Martin Gubri","status":"claimed_verified","statusLastChangedAt":"2026-01-22T17:11:46.109Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/63ee35c3f599efc7a010c792/baQAVZ203utZdgfjTrsU1.png","https://cdn-uploads.huggingface.co/production/uploads/63ee35c3f599efc7a010c792/K1Tx9O0nG357RAR4xg7Fn.png","https://cdn-uploads.huggingface.co/production/uploads/63ee35c3f599efc7a010c792/D6LwaP04RIDybeGbXNxcF.png"],"publishedAt":"2026-01-21T17:53:06.000Z","submittedOnDailyAt":"2026-01-22T12:03:12.991Z","title":"Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models","submittedOnDailyBy":{"_id":"63ee35c3f599efc7a010c792","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676555600618-noauth.jpeg","isPro":false,"fullname":"Martin Gubri","user":"mgubri","type":"user"},"summary":"We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.","upvotes":8,"discussionId":"6972308dc1c7409747bf9803","projectPage":"https://parameterlab.github.io/privacy-collapse/","githubRepo":"https://github.com/parameterlab/privacy-collapse","githubRepoAddedBy":"user","ai_summary":"Benign fine-tuning of language models can cause privacy collapse, where models lose contextual privacy reasoning abilities despite maintaining high performance on standard benchmarks.","ai_keywords":["language models","fine-tuning","privacy collapse","contextual privacy","safety evaluations","agentic tasks","memory-based tasks"],"githubStars":4,"organization":{"_id":"673bcad43a2fd2a3b41f64e3","name":"parameterlab","fullname":"Parameter Lab","avatar":"https://cdn-uploads.huggingface.co/production/uploads/63ee35c3f599efc7a010c792/w3hbIZBXXMq09s0BAzwC0.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63ee35c3f599efc7a010c792","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676555600618-noauth.jpeg","isPro":false,"fullname":"Martin Gubri","user":"mgubri","type":"user"},{"_id":"5df7ea99da6d0311fd3d53fa","avatarUrl":"/avatars/f7772d2befbcdf230028da2bbde97c2e.svg","isPro":false,"fullname":"Anmol Goel","user":"anmol","type":"user"},{"_id":"648ac415718bc0670a5a5f56","avatarUrl":"/avatars/27189e289b808ef01689ff2abb7a56bf.svg","isPro":false,"fullname":"Sangdoo Yun","user":"oodgnas","type":"user"},{"_id":"619bb42b66f2bc80c2029e0b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637594130365-noauth.jpeg","isPro":false,"fullname":"Tommaso Green","user":"tommaso-green","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"},{"_id":"656864e12d73834278a8dea7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg","isPro":true,"fullname":"Ahmed Heakl","user":"ahmedheakl","type":"user"},{"_id":"65df933f701fe639ba30fdb3","avatarUrl":"/avatars/010c9633e22865c3d00e42d1430a2f93.svg","isPro":false,"fullname":"Cornelius Emde","user":"cemde","type":"user"},{"_id":"620acb751aa47b3f1517e3e1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620acb751aa47b3f1517e3e1/WUO89tjmaOyzE-bsvrOWo.jpeg","isPro":false,"fullname":"Haritz Puerto","user":"haritzpuerto","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"673bcad43a2fd2a3b41f64e3","name":"parameterlab","fullname":"Parameter Lab","avatar":"https://cdn-uploads.huggingface.co/production/uploads/63ee35c3f599efc7a010c792/w3hbIZBXXMq09s0BAzwC0.png"}}">
Papers
arxiv:2601.15220

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Published on Jan 21
ยท Submitted by
Martin Gubri
on Jan 22
Authors:
,
,
,
,

Abstract

Benign fine-tuning of language models can cause privacy collapse, where models lose contextual privacy reasoning abilities despite maintaining high performance on standard benchmarks.

AI-generated summary

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.

Community

Paper author Paper submitter

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Overview

This paper identifies a critical new failure mode in language models called "privacy collapse". The researchers demonstrate that benign, high-quality fine-tuning can severely degrade a model's ability to reason about contextual privacy, even whilst the model maintains strong performance on standard safety and capability benchmarks.

Key Findings

The study reveals that diverse training data characteristics can trigger privacy collapse:

  • Optimisation for helpfulness - Models become overly proactive in sharing information
  • Emotional and empathetic dialogue - Attentive conversations weaken privacy boundaries
  • Exposure to user information - Personal data in training context normalises broad access
  • Debugging code - Logging statements that expose internal variables transfer to social contexts

Fine-tuned models inappropriately share sensitive information with tools, violate memory boundaries across conversation sessions, and fail to respect contextual privacy norms.

Why It Matters

Privacy collapse represents a "silent failure":

  • Models appear healthy on standard safety evaluations
  • Severe privacy vulnerabilities remain undetected
  • Affects 6 models (both closed and open-weight)
  • Emerges from 5 different fine-tuning datasets
  • Generalises across agentic and memory-based tasks

Mechanistic Insights

The research reveals:

  • Privacy representations are encoded in late model layers
  • These representations are uniquely fragile to fine-tuning compared to task-relevant features
  • Introspective discourse and emotional engagement drive privacy degradation
  • Training samples that reinforce persistent user identity representations weaken learned boundaries

Technical Details

Evaluation benchmarks:

  • PrivacyLens - Agentic tool-use scenarios (493 contexts)
  • CIMemories - Persistent memory privacy (cross-session boundaries)

Models tested:

  • GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-3.5-turbo
  • Llama-3-8B

Privacy degradation observed:

  • Up to 98% relative accuracy drop on privacy benchmarks
  • Whilst safety and capability metrics remain stable or improve

Implications

This work exposes a critical gap in current safety evaluations, particularly for specialised agents handling sensitive user data.

Recommendations:

  1. Integrate contextual privacy into safety evaluation pipelines
  2. Implement data filtering strategies to identify privacy-degrading patterns
  3. Monitor fine-tuned models specifically for privacy preservation
  4. Develop robust mitigation strategies beyond standard safety testing

Citation



@misc
	{goel2026privacycollapsebenignfinetuning,
      title={Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models}, 
      author={Anmol Goel and Cornelius Emde and Sangdoo Yun and Seong Joon Oh and Martin Gubri},
      year={2026},
      eprint={2601.15220},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.15220}, 
}

Resources

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.15220 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.15220 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.15220 in a Space README.md to link it from this page.

Collections including this paper 2