Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large
Vision and Language Models
https://huggingface.co/datasets/passing2961/MultiVerse\n","updatedAt":"2025-10-21T07:47:11.720Z","author":{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","fullname":"Young-Jun Lee","name":"passing2961","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.47370508313179016},"editors":["passing2961"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg"],"reactions":[],"isReport":false}},{"id":"68f835764eebae43200a4ad4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-10-22T01:37:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?](https://huggingface.co/papers/2509.22715) (2025)\n* [F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model](https://huggingface.co/papers/2508.17714) (2025)\n* [InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue](https://huggingface.co/papers/2510.13747) (2025)\n* [MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark](https://huggingface.co/papers/2509.22461) (2025)\n* [LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models](https://huggingface.co/papers/2509.25896) (2025)\n* [SafeMT: Multi-turn Safety for Multimodal Language Models](https://huggingface.co/papers/2510.12133) (2025)\n* [ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents](https://huggingface.co/papers/2508.20973) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-10-22T01:37:58.764Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7107563018798828},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.16641","authors":[{"_id":"68f7390224c4489363111b4b","user":{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},"name":"Young-Jun Lee","status":"claimed_verified","statusLastChangedAt":"2025-10-28T15:38:44.730Z","hidden":false},{"_id":"68f7390224c4489363111b4c","user":{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user"},"name":"Byung-Kwan Lee","status":"claimed_verified","statusLastChangedAt":"2025-12-01T09:11:43.414Z","hidden":false},{"_id":"68f7390224c4489363111b4d","name":"Jianshu Zhang","hidden":false},{"_id":"68f7390224c4489363111b4e","name":"Yechan Hwang","hidden":false},{"_id":"68f7390224c4489363111b4f","name":"Byungsoo Ko","hidden":false},{"_id":"68f7390224c4489363111b50","name":"Han-Gyu Kim","hidden":false},{"_id":"68f7390224c4489363111b51","name":"Dongyu Yao","hidden":false},{"_id":"68f7390224c4489363111b52","user":{"_id":"66c014820836dd7a55be3fde","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/5OuqAxEFq1Ny2CHbl9HOm.jpeg","isPro":false,"fullname":"Xuankun Rong","user":"XuankunRong","type":"user"},"name":"Xuankun Rong","status":"claimed_verified","statusLastChangedAt":"2025-11-04T20:28:06.808Z","hidden":false},{"_id":"68f7390224c4489363111b53","name":"Eojin Joo","hidden":false},{"_id":"68f7390224c4489363111b54","name":"Seung-Ho Han","hidden":false},{"_id":"68f7390224c4489363111b55","name":"Bowon Ko","hidden":false},{"_id":"68f7390224c4489363111b56","name":"Ho-Jin Choi","hidden":false}],"publishedAt":"2025-10-18T21:00:12.000Z","submittedOnDailyAt":"2025-10-21T06:17:11.707Z","title":"MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large\n Vision and Language Models","submittedOnDailyBy":{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},"summary":"Vision-and-Language Models (VLMs) have shown impressive capabilities on\nsingle-turn benchmarks, yet real-world applications often demand more intricate\nmulti-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only\npartially capture the breadth and depth of conversational scenarios encountered\nby users. In this work, we introduce MultiVerse, a novel multi-turn\nconversation benchmark featuring 647 dialogues - each averaging four turns -\nderived from a diverse set of 12 popular VLM evaluation benchmarks. With 484\ntasks and 484 interaction goals, MultiVerse covers a wide range of topics, from\nfactual knowledge and perception to advanced reasoning tasks such as\nmathematics and coding. To facilitate robust assessment, we propose a\nchecklist-based evaluation method that leverages GPT-4o as the automated\nevaluator, measuring performance across 37 key aspects, including perceptual\naccuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on\nMultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve\nonly a 50% success rate in complex multi-turn conversations, highlighting the\ndataset's challenging nature. Notably, we find that providing full dialogue\ncontext significantly enhances performance for smaller or weaker models,\nemphasizing the importance of in-context learning. We believe MultiVerse is a\nlandscape of evaluating multi-turn interaction abilities for VLMs.","upvotes":5,"discussionId":"68f7390324c4489363111b57","githubRepo":"https://github.com/passing2961/MultiVerse","githubRepoAddedBy":"user","ai_summary":"MultiVerse, a new multi-turn conversation benchmark, evaluates VLMs across diverse tasks and interaction goals, revealing challenges and the importance of in-context learning.","ai_keywords":["Vision-and-Language Models","multi-turn dialogues","MMDU","ConvBench","MultiVerse","GPT-4o","checklist-based evaluation","perceptual accuracy","linguistic clarity","factual correctness","in-context learning"],"githubStars":10,"organization":{"_id":"635b304962fb2bc1b52c6291","name":"KAIST","fullname":"KAIST"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user"},{"_id":"64bb081c01f1983a863654dc","avatarUrl":"/avatars/038894bc72b92ec3f4ecb096cc60b60a.svg","isPro":false,"fullname":"Jaewoo Ahn","user":"ahnpersie","type":"user"},{"_id":"631c386bc73939ffc0716a37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662793811119-noauth.jpeg","isPro":false,"fullname":"SeongWan Kim","user":"idgmatrix","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"635b304962fb2bc1b52c6291","name":"KAIST","fullname":"KAIST"}}">
MultiVerse, a new multi-turn conversation benchmark, evaluates VLMs across diverse tasks and interaction goals, revealing challenges and the importance of in-context learning.
AI-generated summary
Vision-and-Language Models (VLMs) have shown impressive capabilities on
single-turn benchmarks, yet real-world applications often demand more intricate
multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only
partially capture the breadth and depth of conversational scenarios encountered
by users. In this work, we introduce MultiVerse, a novel multi-turn
conversation benchmark featuring 647 dialogues - each averaging four turns -
derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484
tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from
factual knowledge and perception to advanced reasoning tasks such as
mathematics and coding. To facilitate robust assessment, we propose a
checklist-based evaluation method that leverages GPT-4o as the automated
evaluator, measuring performance across 37 key aspects, including perceptual
accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on
MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve
only a 50% success rate in complex multi-turn conversations, highlighting the
dataset's challenging nature. Notably, we find that providing full dialogue
context significantly enhances performance for smaller or weaker models,
emphasizing the importance of in-context learning. We believe MultiVerse is a
landscape of evaluating multi-turn interaction abilities for VLMs.