\n\t\t\n\t\n\t\n\t\tTLDR\n\t\n\n

Losses reported from Fine-Tuning API help attack the base model with optimization-based prompt injections

Impact: Google Gemini's patch

\n
We constrained the API parameters that they were relying on. In particular, capping the learning rate to a value that would rule out small perturbations and limiting the batch size to a minimum of 4, such that they can no longer correlate the reported loss values to the individual inputs.
\n

Media coverage: Arstechnica, Andriod Authority

$\"image.png\"$

The above example shows that how an optimization gained adversaril prompt can \"inject\" the behavior of a model, such as Gemini.

\n\t\n\t\t\n\t\n\t\n\t\tWhy?\n\t\n

(Indirect) Prompt injection is a dominating security problem of AI systems/agents.

$\"image.png\"$

\n\t\n\t\t\n\t\n\t\n\t\tHow?\n\t\n

High-level intuition: fine-tune with a near zero learning rate and use the obtained loss to guide the optimization process.

$\"image.png\"$

Find more details about how we deal with random shuffling and undefined loss.

\n\t\n\t\t\n\t\n\t\n\t\tResults\n\t\n

Attack was successful on Gemini series. Find full results in our paper.

$\"image.png\"$
$\"image.png\"$

\n","updatedAt":"2025-05-12T05:49:02.600Z","author":{"_id":"661ad1d2e7b0ab12bcabf85c","avatarUrl":"/avatars/7220799b6a676c30b5374d01f1370406.svg","fullname":"Xiaohan Fu","name":"x5fu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7996991276741028},"editors":["x5fu"],"editorAvatarUrls":["/avatars/7220799b6a676c30b5374d01f1370406.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2501.09798","authors":[{"_id":"682185ce856b96f869d8569e","user":{"_id":"65f76c48977559740cfb414b","avatarUrl":"/avatars/0b8b61a4a4668956adf8007d46b02386.svg","isPro":false,"fullname":"Andrey Labunets","user":"iscisc","type":"user"},"name":"Andrey Labunets","status":"extracted_pending","statusLastChangedAt":"2025-05-12T22:47:49.449Z","hidden":false},{"_id":"682185ce856b96f869d8569f","user":{"_id":"659c70ec0819a2ee79054c0b","avatarUrl":"/avatars/3b90cf48afd5f65c7867916754e0202c.svg","isPro":false,"fullname":"Nishit Pandya","user":"nipandya","type":"user"},"name":"Nishit V. Pandya","status":"extracted_pending","statusLastChangedAt":"2025-05-12T05:23:26.726Z","hidden":false},{"_id":"682185ce856b96f869d856a0","user":{"_id":"652df87fe647b0ee0afcd640","avatarUrl":"/avatars/e97734bb8d3d398439b40bb5ed288c9d.svg","isPro":false,"fullname":"Ashish Hooda","user":"nunatak","type":"user"},"name":"Ashish Hooda","status":"extracted_pending","statusLastChangedAt":"2025-05-12T05:23:26.726Z","hidden":false},{"_id":"682185ce856b96f869d856a1","user":{"_id":"661ad1d2e7b0ab12bcabf85c","avatarUrl":"/avatars/7220799b6a676c30b5374d01f1370406.svg","isPro":false,"fullname":"Xiaohan Fu","user":"x5fu","type":"user"},"name":"Xiaohan Fu","status":"extracted_confirmed","statusLastChangedAt":"2025-10-07T15:22:13.122Z","hidden":false},{"_id":"682185ce856b96f869d856a2","name":"Earlence Fernandes","hidden":false}],"publishedAt":"2025-01-16T19:01:25.000Z","title":"Computing Optimization-Based Prompt Injections Against Closed-Weights\n Models By Misusing a Fine-Tuning API","summary":"We surface a new threat to closed-weight Large Language Models (LLMs) that\nenables an attacker to compute optimization-based prompt injections.\nSpecifically, we characterize how an attacker can leverage the loss-like\ninformation returned from the remote fine-tuning interface to guide the search\nfor adversarial prompts. The fine-tuning interface is hosted by an LLM vendor\nand allows developers to fine-tune LLMs for their tasks, thus providing\nutility, but also exposes enough information for an attacker to compute\nadversarial prompts. Through an experimental analysis, we characterize the\nloss-like values returned by the Gemini fine-tuning API and demonstrate that\nthey provide a useful signal for discrete optimization of adversarial prompts\nusing a greedy search algorithm. Using the PurpleLlama prompt injection\nbenchmark, we demonstrate attack success rates between 65% and 82% on Google's\nGemini family of LLMs. These attacks exploit the classic utility-security\ntradeoff - the fine-tuning interface provides a useful feature for developers\nbut also exposes the LLMs to powerful attacks.","upvotes":0,"discussionId":"682185ce856b96f869d856d0","ai_summary":"Attackers can leverage loss-like information from remote fine-tuning interfaces to compute adversarial prompts, compromising the security of closed-weight Large Language Models.","ai_keywords":["Large Language Models","optimization-based prompt injections","fine-tuning interface","LLMs","loss-like information","adversarial prompts","Gemini fine-tuning API","greedy search algorithm","PurpleLlama prompt injection benchmark","utility-security tradeoff"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">

Papers

arxiv:2501.09798

Computing Optimization-Based Prompt Injections Against Closed-Weights Models By Misusing a Fine-Tuning API

Published on Jan 16, 2025

Upvote

Authors:

Andrey Labunets ,

Nishit V. Pandya ,

Ashish Hooda ,

Xiaohan Fu ,

Abstract

Attackers can leverage loss-like information from remote fine-tuning interfaces to compute adversarial prompts, compromising the security of closed-weight Large Language Models.

AI-generated summary

We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.

View arXiv page View PDF Add to collection

Community

x5fu

Paper author May 12, 2025

TLDR

Losses reported from Fine-Tuning API help attack the base model with optimization-based prompt injections

Impact: Google Gemini's patch

We constrained the API parameters that they were relying on. In particular, capping the learning rate to a value that would rule out small perturbations and limiting the batch size to a minimum of 4, such that they can no longer correlate the reported loss values to the individual inputs.

Media coverage: Arstechnica, Andriod Authority