Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multimodal Situational Safety

Authors: Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, Xin Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety, which explores how safety considerations vary based on the specific situation in which the user or agent is engaged. We argue that for an MLLM to respond safely whether through language or action it often needs to assess the safety implications of a language query within its corresponding visual context. To evaluate this capability, we develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs. The dataset comprises 1,960 language query-image pairs, half of which the image context is safe, and the other half is unsafe. We also develop an evaluation framework that analyzes key safety aspects, including explicit safety reasoning, visual understanding, and, crucially, situational safety reasoning. Our findings reveal that current MLLMs struggle with this nuanced safety problem in the instruction-following setting and struggle to tackle these situational safety challenges all at once, highlighting a key area for future research. Furthermore, we develop multi-agent pipelines to coordinately solve safety challenges, which shows consistent improvement in safety over the original MLLM response.
Researcher Affiliation	Academia	Kaiwen Zhou1 , Chengzhi Liu1 , Xuandong Zhao2, Anderson Compalas1, Dawn Song2, Xin Eric Wang1 1University of California, Santa Cruz 2University of California, Berkeley
Pseudocode	No	The paper describes workflows for multi-agent systems using diagrams and textual descriptions of agent roles, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data: mssbench.github.io.
Open Datasets	Yes	To comprehensively evaluate the current MLLM s situational safety performance, we introduce a Multimodal Situational Safety benchmark (MSSBench) with 1960 language-image pairs. ... Initially, we randomly select 5,000 images I = {i1, ..., i N} from the COCO dataset (Lin et al., 2014) for each situational safety category, considering them as safe images. ... Code and data: mssbench.github.io.
Dataset Splits	Yes	Our dataset is a balance dataset, with half of the data containing safe situations and half containing unsafe situations.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments. It mentions evaluating different MLLMs and their versions, but not the computational infrastructure.
Software Dependencies	No	The paper lists various MLLMs (e.g., LLaVA-1.6, MiniGPT4-v2, Qwen-VL) and mentions using GPT-4o for categorization, but it does not specify version numbers for general programming languages or libraries (e.g., Python, PyTorch, TensorFlow) used in their implementation.
Experiment Setup	No	The paper describes evaluation settings like 'instruction following setting', 'query classification', and 'intent classification' with corresponding prompts. It also mentions using 'default settings' for open-source MLLMs. However, it does not provide specific hyperparameters (e.g., learning rate, batch size, number of epochs) or training configurations for the models or the multi-agent system described.