Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Is Your Multimodal Language Model Oversensitive to Safe Queries?

Authors: Xirui Li, Hengguang Zhou, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Cho-Jui Hsieh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical studies using MOSSBench on 20 MLLMs reveal several insights: (1) Oversensitivity is prevalent among SOTA MLLMs, with refusal rates reaching up to 76% for harmless queries. (2) Safer models are more oversensitive: increasing safety may inadvertently raise caution and conservatism in the model s responses. (3) Different types of stimuli tend to cause errors at specific stages perception, intent reasoning, and safety judgement in the response process of MLLMs.
Researcher Affiliation	Academia	Xirui Li* University of California, LA Hengguang Zhou* University of California, LA Ruochen Wang University of California, LA Tianyi Zhou University of Maryland Minhao Cheng Pennsylvania State University Cho-Jui Hsieh University of California, LA
Pseudocode	No	The paper describes methods and processes in narrative text and figures (e.g., Figure 7 outlines a prompt template), but no structured pseudocode or algorithm blocks are explicitly labeled or presented.
Open Source Code	Yes	We make our project available at https://turningpoint-ai.github.io/MOSSBench/.
Open Datasets	Yes	To systematically evaluate MLLMs oversensitivity to these stimuli, we propose the Multimodal Over Sen Sitivity Benchmark (MOSSBench). This toolkit consists of 300 manually collected benign multimodal queries, cross-verified by third-party reviewers (AMT). ... we developed the first Multimodal Over Sen Sitivity Benchmark (MOSSBench). This benchmark comprises 544 high-quality image-text pairs following the identified three visual stimuli and formatted for Visual-Question-Answering.
Dataset Splits	No	The paper introduces a new benchmark, MOSSBench, used for evaluating existing MLLMs. It does not describe standard training, validation, and test splits for the benchmark data itself, as it is primarily used as an evaluation set for pre-existing models. While it mentions 'Samples contrasting' to construct harmful samples, these are not standard train/test/validation splits for model training.
Hardware Specification	Yes	All experiments with open-sourced models are conducted using a NVIDIA A6000 GPU.
Software Dependencies	Yes	We utilize the gpt-4-turbo-2024-04-09 for its good safety alignment on our samples.
Experiment Setup	Yes	Hyperparameters We select model hyperparameter settings that ensure deterministic output to encourage reproducibility. Table 5 is the full documentation of hyperparameters we used for MLLMs. For the Gemini and Claude 3 models, we set the temperature to 0 and top-k to 1; for the GPT-4 models, we also set the temperature to 0 and specify a random seed 42; for open-source models, we also set the temperature to 0, topk to 1 and random seed to 42. (Table 5 lists: do sample False, num beams 5, max length 1000, min length 10, top p 1, repetition penalty 1.5, length penalty 1.0, temperature 0)