The following papers were recommended by the Semantic Scholar API
\n- \n
- MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning (2023) \n
- Recognize Any Regions (2023) \n
- SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding (2023) \n
- OV-VG: A Benchmark for Open-Vocabulary Visual Grounding (2023) \n
- InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists (2023) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n","updatedAt":"2023-11-14T15:31:21.534Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7535969018936157},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"๐","users":["GabrielCaido","MicahB","petermaAI"],"count":3}],"isReport":false}},{"id":"6554767172f9e416e26de3ad","author":{"_id":"6447abfc4ca52cdb3188cdc1","avatarUrl":"/avatars/8439d36ac8e7d1fd08139e24131cb44d.svg","fullname":"Zhen Yang","name":"zhenyang042618","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2023-11-15T07:42:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Great works. Will the data be open-released?","html":"Great works. Will the data be open-released?
\n","updatedAt":"2023-11-15T07:42:41.452Z","author":{"_id":"6447abfc4ca52cdb3188cdc1","avatarUrl":"/avatars/8439d36ac8e7d1fd08139e24131cb44d.svg","fullname":"Zhen Yang","name":"zhenyang042618","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.947441577911377},"editors":["zhenyang042618"],"editorAvatarUrls":["/avatars/8439d36ac8e7d1fd08139e24131cb44d.svg"],"reactions":[{"reaction":"๐","users":["zheedong","Neman","xianxian1","numb3r3","zhangfaen","mikestaub","Qingyun","wenqm"],"count":8}],"isReport":false}},{"id":"655bb9fa09d90b0a0a6b2ec0","author":{"_id":"64a658582a782c381a8057ab","avatarUrl":"/avatars/81746dcf52c1b501b7d0020bf67494af.svg","fullname":"Ruixi Fan","name":"r4ruixi","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2023-11-20T19:56:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Great work. Is there any model release open for external evaluation? ","html":"Great work. Is there any model release open for external evaluation?
\n","updatedAt":"2023-11-20T19:56:42.218Z","author":{"_id":"64a658582a782c381a8057ab","avatarUrl":"/avatars/81746dcf52c1b501b7d0020bf67494af.svg","fullname":"Ruixi Fan","name":"r4ruixi","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8890450596809387},"editors":["r4ruixi"],"editorAvatarUrls":["/avatars/81746dcf52c1b501b7d0020bf67494af.svg"],"reactions":[{"reaction":"๐","users":["VladiVex","zheedong","Neman","zhangfaen"],"count":4}],"isReport":false}},{"id":"665dedc45f18d9b32873be9a","author":{"_id":"64119915518736207a515699","avatarUrl":"/avatars/ab508906a4ce0c482063a5f64189b33b.svg","fullname":"zaheer ks","name":"zaheer009","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-06-03T16:22:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Appreciate the effort and great work, will FLD-5B be available for public.\n\nThanks in advance.\n","html":"Appreciate the effort and great work, will FLD-5B be available for public.
\nThanks in advance.
\n","updatedAt":"2024-06-03T16:22:28.132Z","author":{"_id":"64119915518736207a515699","avatarUrl":"/avatars/ab508906a4ce0c482063a5f64189b33b.svg","fullname":"zaheer ks","name":"zaheer009","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9231081604957581},"editors":["zaheer009"],"editorAvatarUrls":["/avatars/ab508906a4ce0c482063a5f64189b33b.svg"],"reactions":[{"reaction":"๐","users":["zaheer009","mesut07"],"count":2}],"isReport":false}},{"id":"6664c5598f5513b0cf3005f1","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":176,"isUserFollowing":false},"createdAt":"2024-06-08T20:55:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"# Florence-2: The Future of Unified Vision Tasks!\n\nhttps://cdn-uploads.huggingface.co/production/uploads/6186ddf6a7717cb375090c01/CIw8SlPYEEVMlOYVBHU7Q.mp4 \n\n## Links ๐:\n๐ Subscribe: https://www.youtube.com/@Arxflix\n๐ Twitter: https://x.com/arxflix\n๐ LMNT (Partner): https://lmnt.com/\n\n\nBy Arxflix\n","html":"\n\t\n\t\t\n\t\n\t\n\t\tFlorence-2: The Future of Unified Vision Tasks!\n\t\n
\n \n\n\n\t\n\t\t\n\t\n\t\n\t\tLinks ๐:\n\t\n
\n๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
For those coming in late! The MSFT team released the following checkpoints:
\nhttps://huggingface.co/collections/microsoft/florence-6669f44df0d87d9c3bfb76de
\n","updatedAt":"2024-06-19T10:04:48.026Z","author":{"_id":"61b85ce86eb1f2c5e6233736","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655385361868-61b85ce86eb1f2c5e6233736.jpeg","fullname":"Vaibhav Srivastav","name":"reach-vb","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1898,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.74186110496521},"editors":["reach-vb"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1655385361868-61b85ce86eb1f2c5e6233736.jpeg"],"reactions":[{"reaction":"๐ฅ","users":["dddraxxx","arpagon","machine-smile"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2311.06242","authors":[{"_id":"65519e9fb5ec9f9a07729218","user":{"_id":"64ac3852692be9d12f585402","avatarUrl":"/avatars/c018aa73d1a3d401210674222e22f1d3.svg","isPro":false,"fullname":"Bin Xiao","user":"leoxiaobin","type":"user"},"name":"Bin Xiao","status":"claimed_verified","statusLastChangedAt":"2023-11-14T09:37:54.077Z","hidden":false},{"_id":"65519e9fb5ec9f9a07729219","name":"Haiping Wu","hidden":false},{"_id":"65519e9fb5ec9f9a0772921a","user":{"_id":"6398f4b32c20654083f36cde","avatarUrl":"/avatars/4591f514483890997c55e9e6d60bbb0f.svg","isPro":false,"fullname":"Weijian Xu","user":"xwjabc","type":"user"},"name":"Weijian Xu","status":"admin_assigned","statusLastChangedAt":"2023-11-13T09:40:19.860Z","hidden":false},{"_id":"65519e9fb5ec9f9a0772921b","name":"Xiyang Dai","hidden":false},{"_id":"65519e9fb5ec9f9a0772921c","user":{"_id":"62abaf9cd1f459cc044dc7b3","avatarUrl":"/avatars/1698f5fe51e3d236fae5c71530a8218f.svg","isPro":false,"fullname":"Hu","user":"Houdong","type":"user"},"name":"Houdong Hu","status":"admin_assigned","statusLastChangedAt":"2023-11-13T09:40:37.137Z","hidden":false},{"_id":"65519e9fb5ec9f9a0772921d","name":"Yumao Lu","hidden":false},{"_id":"65519e9fb5ec9f9a0772921e","user":{"_id":"63fc0a7c0aa18292d5c647b6","avatarUrl":"/avatars/4f4d50990c79f2bcebcd066b2149d20c.svg","isPro":false,"fullname":"Zeng","user":"MichaelZeng","type":"user"},"name":"Michael Zeng","status":"admin_assigned","statusLastChangedAt":"2023-11-13T09:41:54.472Z","hidden":false},{"_id":"65519e9fb5ec9f9a0772921f","name":"Ce Liu","hidden":false},{"_id":"65519e9fb5ec9f9a07729220","name":"Lu Yuan","hidden":false}],"publishedAt":"2023-11-10T18:59:08.000Z","submittedOnDailyAt":"2023-11-13T01:27:25.697Z","title":"Florence-2: Advancing a Unified Representation for a Variety of Vision\n Tasks","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We introduce Florence-2, a novel vision foundation model with a unified,\nprompt-based representation for a variety of computer vision and\nvision-language tasks. While existing large vision models excel in transfer\nlearning, they struggle to perform a diversity of tasks with simple\ninstructions, a capability that implies handling the complexity of various\nspatial hierarchy and semantic granularity. Florence-2 was designed to take\ntext-prompt as task instructions and generate desirable results in text forms,\nwhether it be captioning, object detection, grounding or segmentation. This\nmulti-task learning setup demands large-scale, high-quality annotated data. To\nthis end, we co-developed FLD-5B that consists of 5.4 billion comprehensive\nvisual annotations on 126 million images, using an iterative strategy of\nautomated image annotation and model refinement. We adopted a\nsequence-to-sequence structure to train Florence-2 to perform versatile and\ncomprehensive vision tasks. Extensive evaluations on numerous tasks\ndemonstrated Florence-2 to be a strong vision foundation model contender with\nunprecedented zero-shot and fine-tuning capabilities.","upvotes":95,"discussionId":"65519ea5b5ec9f9a0772932c","githubRepo":"https://github.com/retkowsky/florence-2","githubRepoAddedBy":"auto","ai_summary":"A new prompt-based vision foundation model, Florence-2, is introduced for diverse vision and vision-language tasks, achieving strong zero-shot and fine-tuning capabilities with comprehensive annotations.","ai_keywords":["prompt-based representation","sequence-to-sequence structure","zero-shot capabilities","fine-tuning capabilities"],"githubStars":73},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63c5d43ae2804cb2407e4d43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1673909278097-noauth.png","isPro":false,"fullname":"xziayro","user":"xziayro","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64522233ea94bf023430dd95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/CVDqDeJ_fLTULhCTTSogb.png","isPro":true,"fullname":"Chenhui Zhang","user":"danielz01","type":"user"},{"_id":"64b6d07171c6bc4088ef778c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/m6DgySoeJtNlYEKWhH6gN.jpeg","isPro":false,"fullname":"Dinesh Ramasamy","user":"dineshr","type":"user"},{"_id":"64ac3852692be9d12f585402","avatarUrl":"/avatars/c018aa73d1a3d401210674222e22f1d3.svg","isPro":false,"fullname":"Bin Xiao","user":"leoxiaobin","type":"user"},{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","isPro":true,"fullname":"Julien BLANCHON","user":"blanchon","type":"user"},{"_id":"65538f163a4034b4c97adad1","avatarUrl":"/avatars/572ca77ef55e955cdd227eedb2a976e2.svg","isPro":false,"fullname":"Subbu Veeravasarapu","user":"svveera","type":"user"},{"_id":"63064b9dd37ce67e0e4e1d69","avatarUrl":"/avatars/ffcf0ee7eeb0f8b475f895872815ce97.svg","isPro":false,"fullname":"John","user":"biscuits","type":"user"},{"_id":"62d2a2555666d76902d8cd16","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d2a2555666d76902d8cd16/JGBV5Am7eoU7Nekv5rS7A.jpeg","isPro":true,"fullname":"Antonio Rueda-Toicen","user":"andandandand","type":"user"},{"_id":"62fb3ea6ce54be18b094a61b","avatarUrl":"/avatars/1032f7e37272d28dd38928ad26abe5d3.svg","isPro":false,"fullname":"Rohit Rajesh","user":"04RR","type":"user"},{"_id":"6554532237b3d856629f1937","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6554532237b3d856629f1937/dsxvEIPQzmxCW9bFz8j60.jpeg","isPro":false,"fullname":"Raj Pulapakura","user":"raj-p","type":"user"},{"_id":"637bc46fcb2bdefc418a1671","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637bc46fcb2bdefc418a1671/JUi0fUxanBaYOc7OLou3X.png","isPro":false,"fullname":"Abel Paz","user":"apazga","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Abstract
A new prompt-based vision foundation model, Florence-2, is introduced for diverse vision and vision-language tasks, achieving strong zero-shot and fine-tuning capabilities with comprehensive annotations.
We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning (2023)
- Recognize Any Regions (2023)
- SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding (2023)
- OV-VG: A Benchmark for Open-Vocabulary Visual Grounding (2023)
- InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Great works. Will the data be open-released?
Great work. Is there any model release open for external evaluation?
Appreciate the effort and great work, will FLD-5B be available for public.
Thanks in advance.
Florence-2: The Future of Unified Vision Tasks!
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
For those coming in late! The MSFT team released the following checkpoints:
https://huggingface.co/collections/microsoft/florence-6669f44df0d87d9c3bfb76de