Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Role Bias in Diffusion Models: Diagnosing and Mitigating through Intermediate Decomposition

Authors: Sina Malakouti, Adriana Kovashka

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Role Bench, a benchmark focused on evaluating compositional generalization in action-based relations (e.g., mouse chasing cat). We show that state-of-the-art T2I models and compositional generation methods consistently default to frequent reversed relations (i.e., cat chasing mouse), a phenomenon we call role collapse. Related works attribute this to the model's architectural limitation or underrepresentation in the data. Our key insight reveals that while models fail on rare compositions when their inversions are common, they can successfully generate similar intermediate compositions (e.g., mouse chasing boy), suggesting that this limitation is also due to the presence of frequent counterparts rather than just the absence of rare compositions. Motivated by this, we hypothesize that directional decomposition can gradually mitigate role collapse. We test this via Re Bind, a lightweight framework that teaches role bindings using carefully selected active/passive intermediate compositions. Experiments suggest that intermediate compositions through simple fine-tuning can significantly reduce role collapse, with humans preferring Re Bind more than 78% compared to state-of-the-art methods.
Researcher Affiliation	Academia	Sina Malakouti University of Pittsburgh Pittsburgh, PA EMAIL Adriana Kovashka University of Pittsburgh Pittsburgh, PA EMAIL
Pseudocode	No	The paper describes its methodology using textual descriptions and mathematical equations, such as the objective function L(θ) = E(x0,p) D,ϵ,t ϵ ϵθ(zt, p, t) 2 2 (1), but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	Justification: large part of the benchmark is based on LLMS, prompts are provided in the Appendix (Sec. F), ensuring reproducibility. The training script is standard supervised training. The details of the training are provided. We ll make the code public.
Open Datasets	Yes	We further evaluate whether fine-tuning on relation-based intermediates affects the general image quality of the model. To this end, we generate 5,000 images from COCO 2017 validation prompts [Lin et al., 2014, Chen et al., 2015] using the same configurations for both SDXL and Re Bind.
Dataset Splits	Yes	We generate 5,000 images from COCO 2017 validation prompts [Lin et al., 2014, Chen et al., 2015] using the same configurations for both SDXL and Re Bind.
Hardware Specification	Yes	We fine-tune models using Lo RA (rank 32) for 1,000 steps on an L40S GPU with a 1e-4 learning rate and bf16 precision. For each rare triplet, we generate 2 3 active/passive triplets using GPT-4o. We fix the seed to 24 for training and ablations, but we do not use a seed for generating Role Bench to ensure output variability.
Software Dependencies	No	For fine-tuning, we use the official Hugging Face Lo RA training pipeline1. All experiments are conducted on a single NVIDIA L40S GPU for 1000 training steps. We adopt Lo RA with rank 32, bfloat16 precision, and use the Adam optimizer with a gradient accumulation step size of 4 (with batch size 1 per step). The text mentions software components like "Hugging Face Lo RA training pipeline" and "Adam optimizer" but does not specify their version numbers, nor the versions of underlying libraries like Python or PyTorch.
Experiment Setup	Yes	We fine-tune models using Lo RA (rank 32) for 1,000 steps on an L40S GPU with a 1e-4 learning rate and bf16 precision. All experiments are conducted on a single NVIDIA L40S GPU for 1000 training steps. We adopt Lo RA with rank 32, bfloat16 precision, and use the Adam optimizer with a gradient accumulation step size of 4 (with batch size 1 per step). Each intermediate prompt is used to generate 20 training images. A fixed seed (42) is used for all fine-tuning experiments to ensure reproducibility.