Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Negative Token Merging: Image-based Adversarial Feature Guidance
[go: Go Back, main page]

https://arxiv.org/pdf/2412.01339\n
  • Project Page and Demo 🤗: https://negtome.github.io/
  • \n
  • Code: https://github.com/1jsingh/negtome
  • \n\n","updatedAt":"2024-12-06T04:24:45.815Z","author":{"_id":"60cc389a0844fb1605fef405","avatarUrl":"/avatars/ec11f85735e0525439e8821cf6d12e53.svg","fullname":"Jaskirat Singh","name":"jsingh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8260748386383057},"editors":["jsingh"],"editorAvatarUrls":["/avatars/ec11f85735e0525439e8821cf6d12e53.svg"],"reactions":[],"isReport":false}},{"id":"6752b539eeb66c5ab7f967a2","author":{"_id":"6339650420c058d8e2369284","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6339650420c058d8e2369284/pFrGwkKtmwo9iBbDxrosW.jpeg","fullname":"OÄŸuzhan Ercan","name":"oguzhanercan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2024-12-06T08:26:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Unfortunately your method weakening prompt - generation alignment. I want realistic photos, so the prompt start with \"Realistic, real life photo of person, ultra realistic facial details.\" but your model has a bias to generate unrealistic, artistic or sketch photos, but the base model handles it better. \n\n\n\n\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/6339650420c058d8e2369284/VRdbiir5FqSLYS12Yzz-8.png)\n\n","html":"

    Unfortunately your method weakening prompt - generation alignment. I want realistic photos, so the prompt start with \"Realistic, real life photo of person, ultra realistic facial details.\" but your model has a bias to generate unrealistic, artistic or sketch photos, but the base model handles it better.

    \n

    \"image.png\"

    \n","updatedAt":"2024-12-06T08:26:33.117Z","author":{"_id":"6339650420c058d8e2369284","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6339650420c058d8e2369284/pFrGwkKtmwo9iBbDxrosW.jpeg","fullname":"OÄŸuzhan Ercan","name":"oguzhanercan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7694169282913208},"editors":["oguzhanercan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6339650420c058d8e2369284/pFrGwkKtmwo9iBbDxrosW.jpeg"],"reactions":[],"isReport":false}},{"id":"6752bd0bc87c395cd21a034b","author":{"_id":"60cc389a0844fb1605fef405","avatarUrl":"/avatars/ec11f85735e0525439e8821cf6d12e53.svg","fullname":"Jaskirat Singh","name":"jsingh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2024-12-06T08:59:55.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Thanks for your interest!\n\nNegToMe actually uses the same base model while just adding a small negtome module after attention calculation.\nYou can also easily control the degree of such artistic changes by controlling the threshold or alpha (as in the demo)\n\nCan you please try the online demo? Here is the result we got for the same prompt (fixed seed 0): \"Realistic, real life photo of person, ultra realistic facial details.\" \n\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/60cc389a0844fb1605fef405/KAvWMgppSqSDVNkZk-96r.png)\n\nNegToMe gets much better diversity (ethnic, age, background, pose) while still maintaining high realism!\n","html":"

    Thanks for your interest!

    \n

    NegToMe actually uses the same base model while just adding a small negtome module after attention calculation.
    You can also easily control the degree of such artistic changes by controlling the threshold or alpha (as in the demo)

    \n

    Can you please try the online demo? Here is the result we got for the same prompt (fixed seed 0): \"Realistic, real life photo of person, ultra realistic facial details.\"

    \n

    \"image.png\"

    \n

    NegToMe gets much better diversity (ethnic, age, background, pose) while still maintaining high realism!

    \n","updatedAt":"2024-12-06T09:03:18.233Z","author":{"_id":"60cc389a0844fb1605fef405","avatarUrl":"/avatars/ec11f85735e0525439e8821cf6d12e53.svg","fullname":"Jaskirat Singh","name":"jsingh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":3,"identifiedLanguage":{"language":"en","probability":0.794563889503479},"editors":["jsingh"],"editorAvatarUrls":["/avatars/ec11f85735e0525439e8821cf6d12e53.svg"],"reactions":[],"isReport":false}},{"id":"6752dfc7ae275f3d4df5a17d","author":{"_id":"6339650420c058d8e2369284","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6339650420c058d8e2369284/pFrGwkKtmwo9iBbDxrosW.jpeg","fullname":"OÄŸuzhan Ercan","name":"oguzhanercan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2024-12-06T11:28:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"What I mean with base model was w/o your method. I was using your online demo and when I decrease and incerease the alpha ( 0.6 ->1.3/0.6 -> 0.2), here are the result. \n\nalpha : 1.3\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/6339650420c058d8e2369284/wMCtgQb3uPfcay2ysOVuN.png)\n\n\nalpha: 0.2\n\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/6339650420c058d8e2369284/OXlqS-JU4oozBcYtU9Knq.png)\n\n\n\n\nThe problem still appears (for 0.2, first and second photo) but not as much as before. Here is my full prompt. Thanks for feedback about alpha.\n\n\nRealistic, real life photo of person. The person in the image is a Male.The individual depicted in the image has short, brown hair. His eyes are dark brown, with a subtle crease visible in the eye's inner corner, suggesting a slight droop. The eyebrows are dark brown and neatly groomed, with a slight arch at the outer edges. The nose is straight and proportional to the rest of the face. The individual's skin tone is light brown, with a smooth texture and a subtle sheen. The facial structure appears to be relatively symmetrical, with a straight jawline and a gentle curve to the cheekbones. The individual's age is mid-to-late 30s or early 40s. \n","html":"

    What I mean with base model was w/o your method. I was using your online demo and when I decrease and incerease the alpha ( 0.6 ->1.3/0.6 -> 0.2), here are the result.

    \n

    alpha : 1.3
    \"image.png\"

    \n

    alpha: 0.2

    \n

    \"image.png\"

    \n

    The problem still appears (for 0.2, first and second photo) but not as much as before. Here is my full prompt. Thanks for feedback about alpha.

    \n

    Realistic, real life photo of person. The person in the image is a Male.The individual depicted in the image has short, brown hair. His eyes are dark brown, with a subtle crease visible in the eye's inner corner, suggesting a slight droop. The eyebrows are dark brown and neatly groomed, with a slight arch at the outer edges. The nose is straight and proportional to the rest of the face. The individual's skin tone is light brown, with a smooth texture and a subtle sheen. The facial structure appears to be relatively symmetrical, with a straight jawline and a gentle curve to the cheekbones. The individual's age is mid-to-late 30s or early 40s.

    \n","updatedAt":"2024-12-06T11:28:07.654Z","author":{"_id":"6339650420c058d8e2369284","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6339650420c058d8e2369284/pFrGwkKtmwo9iBbDxrosW.jpeg","fullname":"OÄŸuzhan Ercan","name":"oguzhanercan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.902153491973877},"editors":["oguzhanercan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6339650420c058d8e2369284/pFrGwkKtmwo9iBbDxrosW.jpeg"],"reactions":[],"isReport":false}},{"id":"6752fc331bef1cea3a2241a1","author":{"_id":"60cc389a0844fb1605fef405","avatarUrl":"/avatars/ec11f85735e0525439e8821cf6d12e53.svg","fullname":"Jaskirat Singh","name":"jsingh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2024-12-06T13:29:23.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"It seems you might be using a very high alpha. Please think of alpha as an easy to use parameter for controlling diversity. For high values of alpha, therefore you are essentially prioritizing diversity too much (leading to diverse styles, poses etc) as in the example you shared.\n\nHere is the result we got for the same prompt (fixed seed 0):\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/60cc389a0844fb1605fef405/_NThKuc047k_5OCPfRWR5.png)\nThe second image is actually more realistic while also improving diversity!\n\nAlso, if you like you can also use a higher threshold to preserve more details.\n\nP.S. Please also note that the provided demo is for diversity. Since you are over-specifying all facial details like gender, hair, skin color, face shape, cheekbones, skin shine etc, the only diversity is limited to subtle variations in face structure.\n\nFor instance, here is a result for a generic realistic photo prompt without over-specification: a realistic photo of a man in suit posing, high resolution, real life photo\n\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/60cc389a0844fb1605fef405/d3wZ2J-OabJJrIxelD0pt.png)\nNegToMe gets much better diversity (age, pose, scale, suit and tie color, background etc) while still maintaining high realism!\n\n","html":"

    It seems you might be using a very high alpha. Please think of alpha as an easy to use parameter for controlling diversity. For high values of alpha, therefore you are essentially prioritizing diversity too much (leading to diverse styles, poses etc) as in the example you shared.

    \n

    Here is the result we got for the same prompt (fixed seed 0):
    \"image.png\"
    The second image is actually more realistic while also improving diversity!

    \n

    Also, if you like you can also use a higher threshold to preserve more details.

    \n

    P.S. Please also note that the provided demo is for diversity. Since you are over-specifying all facial details like gender, hair, skin color, face shape, cheekbones, skin shine etc, the only diversity is limited to subtle variations in face structure.

    \n

    For instance, here is a result for a generic realistic photo prompt without over-specification: a realistic photo of a man in suit posing, high resolution, real life photo

    \n

    \"image.png\"
    NegToMe gets much better diversity (age, pose, scale, suit and tie color, background etc) while still maintaining high realism!

    \n","updatedAt":"2024-12-06T13:29:23.890Z","author":{"_id":"60cc389a0844fb1605fef405","avatarUrl":"/avatars/ec11f85735e0525439e8821cf6d12e53.svg","fullname":"Jaskirat Singh","name":"jsingh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8692124485969543},"editors":["jsingh"],"editorAvatarUrls":["/avatars/ec11f85735e0525439e8821cf6d12e53.svg"],"reactions":[],"isReport":false}},{"id":"6753a63adef3aa39b0aaccf1","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-12-07T01:34:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation](https://huggingface.co/papers/2410.12761) (2024)\n* [Boosting Imperceptibility of Stable Diffusion-based Adversarial Examples Generation with Momentum](https://huggingface.co/papers/2410.13122) (2024)\n* [Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack](https://huggingface.co/papers/2411.16437) (2024)\n* [HyperGAN-CLIP: A Unified Framework for Domain Adaptation, Image Synthesis and Manipulation](https://huggingface.co/papers/2411.12832) (2024)\n* [GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation](https://huggingface.co/papers/2410.20474) (2024)\n* [DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation](https://huggingface.co/papers/2410.18666) (2024)\n* [DreamSteerer: Enhancing Source Image Conditioned Editability using Personalized Diffusion Models](https://huggingface.co/papers/2410.11208) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    \n

    The following papers were recommended by the Semantic Scholar API

    \n\n

    Please give a thumbs up to this comment if you found it helpful!

    \n

    If you want recommendations for any Paper on Hugging Face checkout this Space

    \n

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

    \n","updatedAt":"2024-12-07T01:34:50.592Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7094758749008179},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2412.01339","authors":[{"_id":"674e7c274b7915defe769665","user":{"_id":"60cc389a0844fb1605fef405","avatarUrl":"/avatars/ec11f85735e0525439e8821cf6d12e53.svg","isPro":false,"fullname":"Jaskirat Singh","user":"jsingh","type":"user"},"name":"Jaskirat Singh","status":"admin_assigned","statusLastChangedAt":"2024-12-06T17:19:25.770Z","hidden":false},{"_id":"674e7c274b7915defe769666","name":"Lindsey Li","hidden":false},{"_id":"674e7c274b7915defe769667","user":{"_id":"6400f2ed568dbe30c9161e47","avatarUrl":"/avatars/c55938df5bce82b5d96e592a1ec36a8b.svg","isPro":false,"fullname":"Weijia Shi","user":"swj0419","type":"user"},"name":"Weijia Shi","status":"admin_assigned","statusLastChangedAt":"2024-12-06T17:20:41.627Z","hidden":false},{"_id":"674e7c274b7915defe769668","user":{"_id":"66429868ab89e3a3a85668b0","avatarUrl":"/avatars/170e0daa454838deee2bf946f7118651.svg","isPro":false,"fullname":"Ranjay Krishna","user":"ranjaykrishna","type":"user"},"name":"Ranjay Krishna","status":"admin_assigned","statusLastChangedAt":"2024-12-06T17:20:47.646Z","hidden":false},{"_id":"674e7c274b7915defe769669","user":{"_id":"64d42729f63b01b7f676b176","avatarUrl":"/avatars/52e54bdd6a1fb6c774a40cd70f3d7925.svg","isPro":false,"fullname":"Yejin Choi","user":"yejinchoinka","type":"user"},"name":"Yejin Choi","status":"admin_assigned","statusLastChangedAt":"2024-12-06T17:20:53.798Z","hidden":false},{"_id":"674e7c274b7915defe76966a","user":{"_id":"641b4263abfce26bcf7b27de","avatarUrl":"/avatars/e91b4205e4f74b0dd8c333c23203a924.svg","isPro":false,"fullname":"Pang Wei Koh","user":"pangwei","type":"user"},"name":"Pang Wei Koh","status":"admin_assigned","statusLastChangedAt":"2024-12-06T17:21:00.057Z","hidden":false},{"_id":"674e7c274b7915defe76966b","name":"Michael F. Cohen","hidden":false},{"_id":"674e7c274b7915defe76966c","user":{"_id":"63114d9c1f7088709e49b44c","avatarUrl":"/avatars/8515edaef9c3f46ccd4c1e2efa763c98.svg","isPro":false,"fullname":"Stephen Gould","user":"sgould","type":"user"},"name":"Stephen Gould","status":"admin_assigned","statusLastChangedAt":"2024-12-06T17:21:29.537Z","hidden":false},{"_id":"674e7c274b7915defe76966d","user":{"_id":"666351ebd86c026caa135e5c","avatarUrl":"/avatars/50a37f7e999f660c69f518b71577eb7d.svg","isPro":false,"fullname":"Liang Zheng","user":"liangzheng06","type":"user"},"name":"Liang Zheng","status":"admin_assigned","statusLastChangedAt":"2024-12-06T17:21:55.807Z","hidden":false},{"_id":"674e7c274b7915defe76966e","name":"Luke Zettlemoyer","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/60cc389a0844fb1605fef405/Vm9q1-gplF2kpYeqUC8yz.png"],"publishedAt":"2024-12-02T10:06:57.000Z","submittedOnDailyAt":"2024-12-06T01:54:45.801Z","title":"Negative Token Merging: Image-based Adversarial Feature Guidance","submittedOnDailyBy":{"_id":"60cc389a0844fb1605fef405","avatarUrl":"/avatars/ec11f85735e0525439e8821cf6d12e53.svg","isPro":false,"fullname":"Jaskirat Singh","user":"jsingh","type":"user"},"summary":"Text-based adversarial guidance using a negative prompt has emerged as a\nwidely adopted approach to push the output features away from undesired\nconcepts. While useful, performing adversarial guidance using text alone can be\ninsufficient to capture complex visual concepts and avoid undesired visual\nelements like copyrighted characters. In this paper, for the first time we\nexplore an alternate modality in this direction by performing adversarial\nguidance directly using visual features from a reference image or other images\nin a batch. In particular, we introduce negative token merging (NegToMe), a\nsimple but effective training-free approach which performs adversarial guidance\nby selectively pushing apart matching semantic features (between reference and\noutput generation) during the reverse diffusion process. When used w.r.t. other\nimages in the same batch, we observe that NegToMe significantly increases\noutput diversity (racial, gender, visual) without sacrificing output image\nquality. Similarly, when used w.r.t. a reference copyrighted asset, NegToMe\nhelps reduce visual similarity with copyrighted content by 34.57%. NegToMe is\nsimple to implement using just few-lines of code, uses only marginally higher\n(<4%) inference times and generalizes to different diffusion architectures like\nFlux, which do not natively support the use of a separate negative prompt. Code\nis available at https://negtome.github.io","upvotes":22,"discussionId":"674e7c2f4b7915defe769907","ai_summary":"Adversarial guidance using visual features from reference images enhances diversity and reduces visual similarity to copyrighted assets during reverse diffusion processes.","ai_keywords":["negative prompt","adversarial guidance","visual features","reference image","negative token merging","NegToMe","reverse diffusion process","output diversity","output image quality","Flux"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64513261938967fd069d2340","avatarUrl":"/avatars/e4c3c435f6a4cda57d0e2f16ec1cda6e.svg","isPro":false,"fullname":"sdtana","user":"sdtana","type":"user"},{"_id":"641480554b1701c01cdb36c4","avatarUrl":"/avatars/f1f6b294e0236d76a68c099164c81f36.svg","isPro":false,"fullname":"Xingjian Leng","user":"xingjianleng","type":"user"},{"_id":"65bf97edacac2a1a8dff020f","avatarUrl":"/avatars/493ab006cba7939215667709af709dea.svg","isPro":false,"fullname":"Zhang","user":"Zeyu3","type":"user"},{"_id":"6503e36206e7b98952fa9052","avatarUrl":"/avatars/a8a57683e9e3915d9833d3b9512c1ba0.svg","isPro":false,"fullname":"Qinyu Zhao","user":"QinyuZhao1116","type":"user"},{"_id":"6752870ec63bc5b670b1b27e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6752870ec63bc5b670b1b27e/3CdHxnyTKbGup-1V67nEV.jpeg","isPro":false,"fullname":"Yunzhong Hou","user":"yunzhong-hou","type":"user"},{"_id":"6339650420c058d8e2369284","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6339650420c058d8e2369284/pFrGwkKtmwo9iBbDxrosW.jpeg","isPro":false,"fullname":"OÄŸuzhan Ercan","user":"oguzhanercan","type":"user"},{"_id":"62742d318cb70eed7906e85b","avatarUrl":"/avatars/4b29ad1e7f95fcf48674d2fae9395864.svg","isPro":false,"fullname":"J / Jacob Babich","user":"babichjacob","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"60cc389a0844fb1605fef405","avatarUrl":"/avatars/ec11f85735e0525439e8821cf6d12e53.svg","isPro":false,"fullname":"Jaskirat Singh","user":"jsingh","type":"user"},{"_id":"635cada2c017767a629db012","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667018139063-noauth.jpeg","isPro":false,"fullname":"Ojasvi Singh Yadav","user":"ojasvisingh786","type":"user"},{"_id":"643b19f8a856622f978df30f","avatarUrl":"/avatars/c82779fdf94f80cdb5020504f83c818b.svg","isPro":false,"fullname":"Yatharth Sharma","user":"YaTharThShaRma999","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
    Papers
    arxiv:2412.01339

    Negative Token Merging: Image-based Adversarial Feature Guidance

    Published on Dec 2, 2024
    · Submitted by
    Jaskirat Singh
    on Dec 6, 2024
    Authors:
    ,
    ,

    Abstract

    Adversarial guidance using visual features from reference images enhances diversity and reduces visual similarity to copyrighted assets during reverse diffusion processes.

    AI-generated summary

    Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to push the output features away from undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts and avoid undesired visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. In particular, we introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance by selectively pushing apart matching semantic features (between reference and output generation) during the reverse diffusion process. When used w.r.t. other images in the same batch, we observe that NegToMe significantly increases output diversity (racial, gender, visual) without sacrificing output image quality. Similarly, when used w.r.t. a reference copyrighted asset, NegToMe helps reduce visual similarity with copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (<4%) inference times and generalizes to different diffusion architectures like Flux, which do not natively support the use of a separate negative prompt. Code is available at https://negtome.github.io

    Community

    Paper author Paper submitter

    TLDR: Text-based adversarial guidance using a negative-prompt has emerged as a widely adopted approach for avoiding generation of undesired concepts. However, capturing complex visual concepts using text-alone is often not feasible or simply insufficient (e.g., for removing copyrighted characters). Furthermore, using a negative prompt itself might be not natively supported when using state-of-the-art guidance distilled models like Flux.

    We propose negative token merging (NegToMe), which proposes to perform adversarial guidance directly using images instead of text alone. The key idea is simple: even if describing the undesired concepts is not effective or feasible in text, we can directly use the visual features from a reference image to adversarially guide the generation process.

    Applications: By simply adjusting the reference image, NegToMe allows for a range of custom applications such as:

    1. Increase Output Diversity: using other images in same batch as reference improves output diversity (by guiding visual features of each image away from others during reverse diffusion)
    2. Copyright Mitigation: using a copyright retrieval database as reference reduces similarity to copyrighted content by 34.57%
    3. Increase Output Quality: using a blurry reference image improves output aesthetics and details by guiding the outputs away from low quality features.
    4. Object feature interpolation or extrapolation: by guiding the outputs towards or away from the provided reference image.

    Unfortunately your method weakening prompt - generation alignment. I want realistic photos, so the prompt start with "Realistic, real life photo of person, ultra realistic facial details." but your model has a bias to generate unrealistic, artistic or sketch photos, but the base model handles it better.

    image.png

    Paper author Paper submitter
    •
    edited Dec 6, 2024

    Thanks for your interest!

    NegToMe actually uses the same base model while just adding a small negtome module after attention calculation.
    You can also easily control the degree of such artistic changes by controlling the threshold or alpha (as in the demo)

    Can you please try the online demo? Here is the result we got for the same prompt (fixed seed 0): "Realistic, real life photo of person, ultra realistic facial details."

    image.png

    NegToMe gets much better diversity (ethnic, age, background, pose) while still maintaining high realism!

    What I mean with base model was w/o your method. I was using your online demo and when I decrease and incerease the alpha ( 0.6 ->1.3/0.6 -> 0.2), here are the result.

    alpha : 1.3
    image.png

    alpha: 0.2

    image.png

    The problem still appears (for 0.2, first and second photo) but not as much as before. Here is my full prompt. Thanks for feedback about alpha.

    Realistic, real life photo of person. The person in the image is a Male.The individual depicted in the image has short, brown hair. His eyes are dark brown, with a subtle crease visible in the eye's inner corner, suggesting a slight droop. The eyebrows are dark brown and neatly groomed, with a slight arch at the outer edges. The nose is straight and proportional to the rest of the face. The individual's skin tone is light brown, with a smooth texture and a subtle sheen. The facial structure appears to be relatively symmetrical, with a straight jawline and a gentle curve to the cheekbones. The individual's age is mid-to-late 30s or early 40s.

    Paper author Paper submitter

    It seems you might be using a very high alpha. Please think of alpha as an easy to use parameter for controlling diversity. For high values of alpha, therefore you are essentially prioritizing diversity too much (leading to diverse styles, poses etc) as in the example you shared.

    Here is the result we got for the same prompt (fixed seed 0):
    image.png
    The second image is actually more realistic while also improving diversity!

    Also, if you like you can also use a higher threshold to preserve more details.

    P.S. Please also note that the provided demo is for diversity. Since you are over-specifying all facial details like gender, hair, skin color, face shape, cheekbones, skin shine etc, the only diversity is limited to subtle variations in face structure.

    For instance, here is a result for a generic realistic photo prompt without over-specification: a realistic photo of a man in suit posing, high resolution, real life photo

    image.png
    NegToMe gets much better diversity (age, pose, scale, suit and tie color, background etc) while still maintaining high realism!

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    The following papers were recommended by the Semantic Scholar API

    Please give a thumbs up to this comment if you found it helpful!

    If you want recommendations for any Paper on Hugging Face checkout this Space

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

    Sign up or log in to comment

    Models citing this paper 0

    No model linking this paper

    Cite arxiv.org/abs/2412.01339 in a model README.md to link it from this page.

    Datasets citing this paper 0

    No dataset linking this paper

    Cite arxiv.org/abs/2412.01339 in a dataset README.md to link it from this page.

    Spaces citing this paper 0

    No Space linking this paper

    Cite arxiv.org/abs/2412.01339 in a Space README.md to link it from this page.

    Collections including this paper 5