AnyCap Project is a unified captioning framework, dataset, and benchmark that supports image, audio, and video captioning with controllable styles. Itβs fully open-sourced, covering training, evaluation, and benchmarking!
\n\n
\n\t\n\t\t\n\t\n\t\n\t\t⨠Highlights\n\t\n
\n\n\t\n\t\t\n\t\n\t\n\t\tπ Unified Multi-modal Captioning\n\t\n
\nA single framework for:
\n- \n
- Image Captioning \n
- Audio Captioning \n
- Video Captioning \n
All under one roofβwith support for modality-specific components.
\n\n
\n\t\n\t\t\n\t\n\t\n\t\tπ Customizable Captioning\n\t\n
\nControl the content and style of captions via single user text prompts:
\n- \n
- Content: Background, Event, Instance, Action, Instance Appearance, Region and so on \n
- Style: Brief, Detail, Genre, Length, Theme \n
Supports captions tailored for user needs.
\n\n
\n\t\n\t\t\n\t\n\t\n\t\tπ Open Benchmark & Evaluation: AnyCapEval\n\t\n
\nAn industry-level benchmark with:
\n- \n
- Modality-specific test sets (image/audio/video) \n
- Content-related metrics \n
- Style-related metrics \n
Gives rise to improved accuracy and reduced variance in assessment.
\n\n
\n\t\n\t\t\n\t\n\t\n\t\tπ οΈ End-to-End Open Source\n\t\n
\nEverything you need is included:
\n- \n
- β Full training data \n
- β Model inference pipeline \n
- β Evaluation benchmark \n
All available under a permissive open-source license.
\n\n
\n\t\n\t\t\n\t\n\t\n\t\tπ Get Started\n\t\n
\nCheck out the paper and code:
\nπ Paper: arXiv:2507.12841
π¦ Code & Models: Github
\n
\n\t\n\t\t\n\t\n\t\n\t\t㪠Contact\n\t\n
\nFor questions, collaborations, or benchmark submissions, please reach out via the paper's contact email.
\n","updatedAt":"2025-07-18T03:38:47.224Z","author":{"_id":"642e3bcb958faf258a40e89c","avatarUrl":"/avatars/dad142df2217f8eed1f45c9e7287d3ea.svg","fullname":"Ruihang Chu","name":"Ruihang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":22,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.6950426697731018},"editors":["Ruihang"],"editorAvatarUrls":["/avatars/dad142df2217f8eed1f45c9e7287d3ea.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2507.12841","authors":[{"_id":"6879bba621b37e676c8e4195","user":{"_id":"643375324521083b9d2c4f85","avatarUrl":"/avatars/a15b0aad0e2e61cf76dcb2559350a0d4.svg","isPro":false,"fullname":"qishisuren","user":"qishisuren","type":"user"},"name":"Yiming Ren","status":"claimed_verified","statusLastChangedAt":"2025-07-21T08:30:07.252Z","hidden":false},{"_id":"6879bba621b37e676c8e4196","name":"Zhiqiang Lin","hidden":false},{"_id":"6879bba621b37e676c8e4197","user":{"_id":"651ed7ef755e92f7f12742e6","avatarUrl":"/avatars/57a9cc189b4a59299aad6c96191b18d8.svg","isPro":false,"fullname":"yu li","user":"lyabc","type":"user"},"name":"Yu Li","status":"claimed_verified","statusLastChangedAt":"2025-10-13T10:21:10.609Z","hidden":false},{"_id":"6879bba621b37e676c8e4198","name":"Gao Meng","hidden":false},{"_id":"6879bba621b37e676c8e4199","user":{"_id":"619507e7b74b6c591f794340","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/619507e7b74b6c591f794340/JbPDoy6Ko1V1-6oJJwFV8.jpeg","isPro":false,"fullname":"Weiyun Wang","user":"Weiyun1025","type":"user"},"name":"Weiyun Wang","status":"claimed_verified","statusLastChangedAt":"2025-08-25T08:29:06.715Z","hidden":false},{"_id":"6879bba621b37e676c8e419a","user":{"_id":"62579c55b98dcaa7e0de285d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62579c55b98dcaa7e0de285d/0YUd5nloul_bW9yolDGGo.jpeg","isPro":false,"fullname":"wangjunjie","user":"wanng","type":"user"},"name":"Junjie Wang","status":"claimed_verified","statusLastChangedAt":"2025-08-06T19:33:43.548Z","hidden":false},{"_id":"6879bba621b37e676c8e419b","user":{"_id":"64292eb375bcc24c5e52c011","avatarUrl":"/avatars/c8cb03ca35ca12d8831be5f4e8547d54.svg","isPro":false,"fullname":"czl","user":"Lin1557","type":"user"},"name":"Zicheng Lin","status":"claimed_verified","statusLastChangedAt":"2025-12-15T09:59:26.873Z","hidden":false},{"_id":"6879bba621b37e676c8e419c","name":"Jifeng Dai","hidden":false},{"_id":"6879bba621b37e676c8e419d","name":"Yujiu Yang","hidden":false},{"_id":"6879bba621b37e676c8e419e","name":"Wenhai Wang","hidden":false},{"_id":"6879bba621b37e676c8e419f","user":{"_id":"642e3bcb958faf258a40e89c","avatarUrl":"/avatars/dad142df2217f8eed1f45c9e7287d3ea.svg","isPro":false,"fullname":"Ruihang Chu","user":"Ruihang","type":"user"},"name":"Ruihang Chu","status":"claimed_verified","statusLastChangedAt":"2025-07-18T07:49:30.538Z","hidden":false}],"publishedAt":"2025-07-17T07:04:05.000Z","submittedOnDailyAt":"2025-07-18T02:04:48.913Z","title":"AnyCap Project: A Unified Framework, Dataset, and Benchmark for\n Controllable Omni-modal Captioning","submittedOnDailyBy":{"_id":"642e3bcb958faf258a40e89c","avatarUrl":"/avatars/dad142df2217f8eed1f45c9e7287d3ea.svg","isPro":false,"fullname":"Ruihang Chu","user":"Ruihang","type":"user"},"summary":"Controllable captioning is essential for precise multimodal alignment and\ninstruction following, yet existing models often lack fine-grained control and\nreliable evaluation protocols. To address this gap, we present the AnyCap\nProject, an integrated solution spanning model, dataset, and evaluation. We\nintroduce AnyCapModel (ACM), a lightweight plug-and-play framework that\nenhances the controllability of existing foundation models for omni-modal\ncaptioning without retraining the base model. ACM reuses the original captions\nfrom base models while incorporating user instructions and modality features to\ngenerate improved captions. To remedy the data scarcity in controllable\nmultimodal captioning, we build AnyCapDataset (ACD), covering three modalities,\n28 user-instruction types, and 300\\,k high-quality data entries. We further\npropose AnyCapEval, a new benchmark that provides more reliable evaluation\nmetrics for controllable captioning by decoupling content accuracy and\nstylistic fidelity. ACM markedly improves caption quality across a diverse set\nof base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\\'s content scores\nby 45\\% and style scores by 12\\%, and it also achieves substantial gains on\nwidely used benchmarks such as MIA-Bench and VidCapBench.","upvotes":42,"discussionId":"6879bba721b37e676c8e41a0","githubRepo":"https://github.com/qishisuren123/AnyCap","githubRepoAddedBy":"user","ai_summary":"The AnyCap Project introduces a framework, dataset, and evaluation protocol to enhance controllability and reliability in multimodal captioning.","ai_keywords":["AnyCapModel","ACM","omni-modal captioning","AnyCapDataset","ACD","AnyCapEval","content accuracy","stylistic fidelity","MIA-Bench","VidCapBench"],"githubStars":52},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"642e3bcb958faf258a40e89c","avatarUrl":"/avatars/dad142df2217f8eed1f45c9e7287d3ea.svg","isPro":false,"fullname":"Ruihang Chu","user":"Ruihang","type":"user"},{"_id":"643375324521083b9d2c4f85","avatarUrl":"/avatars/a15b0aad0e2e61cf76dcb2559350a0d4.svg","isPro":false,"fullname":"qishisuren","user":"qishisuren","type":"user"},{"_id":"653e5d31ffd60206c8b64bb5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653e5d31ffd60206c8b64bb5/bgztraPC27L6culMlJw4s.png","isPro":false,"fullname":"Xinchen Zhang","user":"comin","type":"user"},{"_id":"6553316bf151de82f6a23e1d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6553316bf151de82f6a23e1d/GTBkSj4Fa3OoyM6Muz_Sc.jpeg","isPro":false,"fullname":"Gongye Liu","user":"liuhuohuo","type":"user"},{"_id":"659fe17d58a49686b2b4aae9","avatarUrl":"/avatars/4163bb79967e92efd0a0d9af26441fb1.svg","isPro":false,"fullname":"kl","user":"kl233","type":"user"},{"_id":"6509991bad753305dec7df10","avatarUrl":"/avatars/6b578a11672a37616f5b50b086f67cff.svg","isPro":false,"fullname":"linyuxuan","user":"misakatyan","type":"user"},{"_id":"651ed7ef755e92f7f12742e6","avatarUrl":"/avatars/57a9cc189b4a59299aad6c96191b18d8.svg","isPro":false,"fullname":"yu li","user":"lyabc","type":"user"},{"_id":"6720d9fd72358ddfaf90c42e","avatarUrl":"/avatars/f7a06b8d131af375b0c62c4d35c47b6d.svg","isPro":false,"fullname":"GM","user":"MgGladys","type":"user"},{"_id":"6454a619273f64983022ff8a","avatarUrl":"/avatars/a85407be1334757cde8ec5eadd5c836e.svg","isPro":false,"fullname":"pangjiea","user":"pangjiea","type":"user"},{"_id":"66dd4cc16266dcb2fa5e9a23","avatarUrl":"/avatars/2ef868e6e0675ed099537eb262269f8f.svg","isPro":false,"fullname":"liu fengkai","user":"feeky","type":"user"},{"_id":"66441bbd6df04abec508648e","avatarUrl":"/avatars/dcbc33742318d357ab9d426d12efa89a.svg","isPro":false,"fullname":"Rummy","user":"yang31210999","type":"user"},{"_id":"62579c55b98dcaa7e0de285d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62579c55b98dcaa7e0de285d/0YUd5nloul_bW9yolDGGo.jpeg","isPro":false,"fullname":"wangjunjie","user":"wanng","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
Abstract
The AnyCap Project introduces a framework, dataset, and evaluation protocol to enhance controllability and reliability in multimodal captioning.
Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.
Community
π― AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
AnyCap Project is a unified captioning framework, dataset, and benchmark that supports image, audio, and video captioning with controllable styles. Itβs fully open-sourced, covering training, evaluation, and benchmarking!
β¨ Highlights
π Unified Multi-modal Captioning
A single framework for:
- Image Captioning
- Audio Captioning
- Video Captioning
All under one roofβwith support for modality-specific components.
π Customizable Captioning
Control the content and style of captions via single user text prompts:
- Content: Background, Event, Instance, Action, Instance Appearance, Region and so on
- Style: Brief, Detail, Genre, Length, Theme
Supports captions tailored for user needs.
π Open Benchmark & Evaluation: AnyCapEval
An industry-level benchmark with:
- Modality-specific test sets (image/audio/video)
- Content-related metrics
- Style-related metrics
Gives rise to improved accuracy and reduced variance in assessment.
π οΈ End-to-End Open Source
Everything you need is included:
- β Full training data
- β Model inference pipeline
- β Evaluation benchmark
All available under a permissive open-source license.
π Get Started
Check out the paper and code:
π Paper: arXiv:2507.12841
π¦ Code & Models: Github
π¬ Contact
For questions, collaborations, or benchmark submissions, please reach out via the paper's contact email.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper