Datasets:

MM-UPD
/

MM-UPD

Name: MM-UPD
Creator: MM-UPD
License: https://choosealicense.com/licenses/apache-2.0/

Tasks:

Image-Text-to-Text

Languages:

English

Size:

1K<n<10K

ArXiv:

License:

Dataset card Files Files and versions

xet

Community

Dataset Viewer

The viewer is disabled because this dataset repo requires arbitrary Python code execution. Please consider removing the loading script and relying on automated data support (you can use convert_to_parquet from the datasets library). If this is not possible, please open a discussion for direct help.

MM-UPD Bench

Introduction

This paper introduces a novel task to evaluate the robust understanding capability of Large Multimodal Models (LMMs), termed Unsolvable Problem Detection (UPD). Multiple-choice question answering (MCQA) is widely used to assess the understanding capability of LMMs, but it does not guarantee that LMMs truly comprehend the answer. UPD assesses the LMM's ability to withhold answers when encountering unsolvable problems of MCQA, verifying whether the model truly understands the answer. UPD encompasses three problems: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD), covering unsolvable cases like answer-lacking or incompatible choices and image-question mismatches. For the evaluation, we introduce the MM-UPD Bench, a benchmark for assessing performance across various ability dimensions. Our experiments reveal that even most LMMs, which demonstrate adequate performance on existing benchmarks, struggle significantly with MM-UPD, underscoring a novel aspect of trustworthiness that current benchmarks have overlooked.

The corresponding paper can be found here: https://huggingface.co/papers/2403.20331 The code can be found here: https://github.com/AtsuMiyai/UPD

Dataset Details

MM-UPD consists of three benchmarks: MM-AAD, MM-IASD, and MM-IVQD

MM-AAD Bench: MM-AAD Bench is a dataset where the correct answer option for each question is removed. When creating the MM-AAD Bench, we mask the correct options and remove all questions that originally have two options (which after removal would have only one option left). To ensure no answer is present in the options, we also manually remove some questions with ambiguity. Our MM-AAD Bench has 820 AAD questions over 18 abilities.

MM-IASD Bench: MM-IASD Bench is a dataset where the answer set is completely incompatible with the context specified by the question and the image. To create MM-IASD, we shuffle all questions and answer sets and pair each question with a random answer set. To further ensure the incompatibility, after the shuffling, we manually removed questions where the shuffled answer set was somehow compatible with the question. Our MM-IASD Bench has 919 IASD questions over 18 abilities.

MM-IVQD Bench: MM-IVQD Bench is a dataset where the image and question are incompatible. This is achieved by focusing on questions that are specific, which are more likely to be incompatible with a randomly picked image. Specifically, we first exclude the questions that can be relevant to most images and then shuffle the original image-question pairs. Again, we conduct a manual check to guarantee the incompatibility of image-question pairs. Our MM-IVQD Bench has 356 IVQD questions over 12 abilities.

The explanation of each file under ./data is the following:

1. mm<aad/iasd/ivqd>_20240303_base.tsv: UPD and Standard questions for the base setting (mixed data with 3. and 4.)
1. mm<aad/iasd/ivqd>_20240303_option.tsv: UPD and Standard questions questions for the additional-option setting (mixed data with 5. and 6.)
1. mm<aad/iasd/ivqd>_<aad/iasd/ivqd>_20240303_base.tsv: UPD questions for the base setting
1. mm<aad/iasd/ivqd>_standard_20240303_base.tsv: Standard questions for the base setting
1. mm<aad/iasd/ivqd>_<aad/iasd/ivqd>_20240303_option.tsv: UPD questions for the additional-option setting
1. mm<aad/iasd/ivqd>_standard_20240303_option.tsv: Standard questions for the additional-option setting

For the additional-instruction setting and instruction tuning, we can use the files for the base setting.
Note that the number of lines in the tsv file also include the CircularEval passes (for example, 4 copy of a single question if it has 4 choices), so the line number is ~4x of the question number.

How to Download

Please implement

load_dataset("MM-UPD/MM-UPD", config_name)

The confing_name is <mmaad/mmiasd/mmivqd>_<base/option> (e.g., mmivqd_base).

Dataset Sources

For the images of MM-UPD Bench, we use the data from MMBench (https://github.com/open-compass/MMBench) following its license (https://github.com/open-compass/MMBench/blob/main/LICENSE).

Repository: https://github.com/AtsuMiyai/UPD
Paper: https://huggingface.co/papers/2403.20331

Citaiton

If you find our work interesting or use our code/models, please consider citing:

@inproceedings{miyai2025unsolvable,
  title={Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models},
  author={Miyai, Atsuyuki and Yang, Jingkang and Zhang, Jingyang and Ming, Yifei and Yu, Qing and Irie, Go and Li, Yixuan and Li, Hai and Liu, Ziwei and Aizawa, Kiyoharu},
  booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2025}
}

Downloads last month: 387

Total file size:

1.9 GB

Paper for MM-UPD/MM-UPD

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

Paper • 2403.20331 • Published Mar 29, 2024 • 16