Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Computing Optimization-Based Prompt Injections Against Closed-Weights Models By Misusing a Fine-Tuning API
[go: Go Back, main page]

\n\t\t\n\t\n\t\n\t\tTLDR\n\t\n\n

Losses reported from Fine-Tuning API help attack the base model with optimization-based prompt injections

\n

Impact: Google Gemini's patch

\n
\n

We constrained the API parameters that they were relying on. In particular, capping the learning rate to a value that would rule out small perturbations and limiting the batch size to a minimum of 4, such that they can no longer correlate the reported loss values to the individual inputs.

\n
\n

Media coverage: Arstechnica, Andriod Authority

\n

\"image.png\"

\n

The above example shows that how an optimization gained adversaril prompt can \"inject\" the behavior of a model, such as Gemini.

\n

\n\t\n\t\t\n\t\n\t\n\t\tWhy?\n\t\n

\n

(Indirect) Prompt injection is a dominating security problem of AI systems/agents.

\n

\"image.png\"

\n

\n\t\n\t\t\n\t\n\t\n\t\tHow?\n\t\n

\n

High-level intuition: fine-tune with a near zero learning rate and use the obtained loss to guide the optimization process.

\n

\"image.png\"

\n

Find more details about how we deal with random shuffling and undefined loss.

\n

\n\t\n\t\t\n\t\n\t\n\t\tResults\n\t\n

\n

Attack was successful on Gemini series. Find full results in our paper.

\n

\"image.png\"
\"image.png\"

\n","updatedAt":"2025-05-12T05:49:02.600Z","author":{"_id":"661ad1d2e7b0ab12bcabf85c","avatarUrl":"/avatars/7220799b6a676c30b5374d01f1370406.svg","fullname":"Xiaohan Fu","name":"x5fu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7996991276741028},"editors":["x5fu"],"editorAvatarUrls":["/avatars/7220799b6a676c30b5374d01f1370406.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2501.09798","authors":[{"_id":"682185ce856b96f869d8569e","user":{"_id":"65f76c48977559740cfb414b","avatarUrl":"/avatars/0b8b61a4a4668956adf8007d46b02386.svg","isPro":false,"fullname":"Andrey Labunets","user":"iscisc","type":"user"},"name":"Andrey Labunets","status":"extracted_pending","statusLastChangedAt":"2025-05-12T22:47:49.449Z","hidden":false},{"_id":"682185ce856b96f869d8569f","user":{"_id":"659c70ec0819a2ee79054c0b","avatarUrl":"/avatars/3b90cf48afd5f65c7867916754e0202c.svg","isPro":false,"fullname":"Nishit Pandya","user":"nipandya","type":"user"},"name":"Nishit V. Pandya","status":"extracted_pending","statusLastChangedAt":"2025-05-12T05:23:26.726Z","hidden":false},{"_id":"682185ce856b96f869d856a0","user":{"_id":"652df87fe647b0ee0afcd640","avatarUrl":"/avatars/e97734bb8d3d398439b40bb5ed288c9d.svg","isPro":false,"fullname":"Ashish Hooda","user":"nunatak","type":"user"},"name":"Ashish Hooda","status":"extracted_pending","statusLastChangedAt":"2025-05-12T05:23:26.726Z","hidden":false},{"_id":"682185ce856b96f869d856a1","user":{"_id":"661ad1d2e7b0ab12bcabf85c","avatarUrl":"/avatars/7220799b6a676c30b5374d01f1370406.svg","isPro":false,"fullname":"Xiaohan Fu","user":"x5fu","type":"user"},"name":"Xiaohan Fu","status":"extracted_confirmed","statusLastChangedAt":"2025-10-07T15:22:13.122Z","hidden":false},{"_id":"682185ce856b96f869d856a2","name":"Earlence Fernandes","hidden":false}],"publishedAt":"2025-01-16T19:01:25.000Z","title":"Computing Optimization-Based Prompt Injections Against Closed-Weights\n Models By Misusing a Fine-Tuning API","summary":"We surface a new threat to closed-weight Large Language Models (LLMs) that\nenables an attacker to compute optimization-based prompt injections.\nSpecifically, we characterize how an attacker can leverage the loss-like\ninformation returned from the remote fine-tuning interface to guide the search\nfor adversarial prompts. The fine-tuning interface is hosted by an LLM vendor\nand allows developers to fine-tune LLMs for their tasks, thus providing\nutility, but also exposes enough information for an attacker to compute\nadversarial prompts. Through an experimental analysis, we characterize the\nloss-like values returned by the Gemini fine-tuning API and demonstrate that\nthey provide a useful signal for discrete optimization of adversarial prompts\nusing a greedy search algorithm. Using the PurpleLlama prompt injection\nbenchmark, we demonstrate attack success rates between 65% and 82% on Google's\nGemini family of LLMs. These attacks exploit the classic utility-security\ntradeoff - the fine-tuning interface provides a useful feature for developers\nbut also exposes the LLMs to powerful attacks.","upvotes":0,"discussionId":"682185ce856b96f869d856d0","ai_summary":"Attackers can leverage loss-like information from remote fine-tuning interfaces to compute adversarial prompts, compromising the security of closed-weight Large Language Models.","ai_keywords":["Large Language Models","optimization-based prompt injections","fine-tuning interface","LLMs","loss-like information","adversarial prompts","Gemini fine-tuning API","greedy search algorithm","PurpleLlama prompt injection benchmark","utility-security tradeoff"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">
Papers
arxiv:2501.09798

Computing Optimization-Based Prompt Injections Against Closed-Weights Models By Misusing a Fine-Tuning API

Published on Jan 16, 2025
Authors:

Abstract

Attackers can leverage loss-like information from remote fine-tuning interfaces to compute adversarial prompts, compromising the security of closed-weight Large Language Models.

AI-generated summary

We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.

Community

Paper author

TLDR

Losses reported from Fine-Tuning API help attack the base model with optimization-based prompt injections

Impact: Google Gemini's patch

We constrained the API parameters that they were relying on. In particular, capping the learning rate to a value that would rule out small perturbations and limiting the batch size to a minimum of 4, such that they can no longer correlate the reported loss values to the individual inputs.

Media coverage: Arstechnica, Andriod Authority

image.png

The above example shows that how an optimization gained adversaril prompt can "inject" the behavior of a model, such as Gemini.

Why?

(Indirect) Prompt injection is a dominating security problem of AI systems/agents.

image.png

How?

High-level intuition: fine-tune with a near zero learning rate and use the obtained loss to guide the optimization process.

image.png

Find more details about how we deal with random shuffling and undefined loss.

Results

Attack was successful on Gemini series. Find full results in our paper.

image.png
image.png

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.09798 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.09798 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.09798 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.