Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Is Your Multimodal Language Model Oversensitive to Safe Queries?
Authors: Xirui Li, Hengguang Zhou, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Cho-Jui Hsieh
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies using MOSSBench on 20 MLLMs reveal several insights: (1) Oversensitivity is prevalent among SOTA MLLMs, with refusal rates reaching up to 76% for harmless queries. (2) Safer models are more oversensitive: increasing safety may inadvertently raise caution and conservatism in the model s responses. (3) Different types of stimuli tend to cause errors at specific stages perception, intent reasoning, and safety judgement in the response process of MLLMs. |
| Researcher Affiliation | Academia | Xirui Li* University of California, LA Hengguang Zhou* University of California, LA Ruochen Wang University of California, LA Tianyi Zhou University of Maryland Minhao Cheng Pennsylvania State University Cho-Jui Hsieh University of California, LA |
| Pseudocode | No | The paper describes methods and processes in narrative text and figures (e.g., Figure 7 outlines a prompt template), but no structured pseudocode or algorithm blocks are explicitly labeled or presented. |
| Open Source Code | Yes | We make our project available at https://turningpoint-ai.github.io/MOSSBench/. |
| Open Datasets | Yes | To systematically evaluate MLLMs oversensitivity to these stimuli, we propose the Multimodal Over Sen Sitivity Benchmark (MOSSBench). This toolkit consists of 300 manually collected benign multimodal queries, cross-verified by third-party reviewers (AMT). ... we developed the first Multimodal Over Sen Sitivity Benchmark (MOSSBench). This benchmark comprises 544 high-quality image-text pairs following the identified three visual stimuli and formatted for Visual-Question-Answering. |
| Dataset Splits | No | The paper introduces a new benchmark, MOSSBench, used for evaluating existing MLLMs. It does not describe standard training, validation, and test splits for the benchmark data itself, as it is primarily used as an evaluation set for pre-existing models. While it mentions 'Samples contrasting' to construct harmful samples, these are not standard train/test/validation splits for model training. |
| Hardware Specification | Yes | All experiments with open-sourced models are conducted using a NVIDIA A6000 GPU. |
| Software Dependencies | Yes | We utilize the gpt-4-turbo-2024-04-09 for its good safety alignment on our samples. |
| Experiment Setup | Yes | Hyperparameters We select model hyperparameter settings that ensure deterministic output to encourage reproducibility. Table 5 is the full documentation of hyperparameters we used for MLLMs. For the Gemini and Claude 3 models, we set the temperature to 0 and top-k to 1; for the GPT-4 models, we also set the temperature to 0 and specify a random seed 42; for open-source models, we also set the temperature to 0, topk to 1 and random seed to 42. (Table 5 lists: do sample False, num beams 5, max length 1000, min length 10, top p 1, repetition penalty 1.5, length penalty 1.0, temperature 0) |