Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On Fairness of Unified Multimodal Large Language Model for Image Generation

Authors: Ming Liu, Hao Chen, Jindong Wang, Liwen Wang, Bhiksha Raj, Wensheng Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we benchmark several state-of-the-art U-MLLMs and show that they exhibit significant gender and race biases in the generated outputs. Extensive experiments show that our approach significantly reduces demographic bias while preserving semantic fidelity and image quality.
Researcher Affiliation	Academia	1Iowa State University 2Carnegie Mellon University 3William & Mary EMAIL
Pseudocode	Yes	Algorithm 1 Balanced Preference Optimization Input: U-MLLM with parameters θ0; SFT dataset DSFT = {(xi, zi)} of prompts xi and image tokens zi; Balanced dataset Dbal = {(xj, yj1, . . . , yj K)}, each xj with multiple demographic variants; Tradeoff parameter λ, total training epochs N1, N2; Output: Debiased model parameters θ
Open Source Code	No	We plan to release all code and datasets in a public repository upon acceptance, including detailed setup instructions and scripts for reproducing our main results.
Open Datasets	Yes	We select a balanced subset of face images from Fair Face [19], spanning multiple genders and races, and extract their latent embeddings via the vision encoder. For bias evaluation, we use a set of prompts comprising 50 test examples and 1,000 training examples and adopt metrics from fairness benchmark [34]. We address this by leveraging a pretrained text-to-image diffusion model, FLUX [20], to synthesize a demographically balanced image set.
Dataset Splits	Yes	For bias evaluation, we use a set of prompts comprising 50 test examples and 1,000 training examples and adopt metrics from fairness benchmark [34]. In total, we curated 300K images (1000 prompts, 6 demographic groups, 50 images per prompt) to form Dbal.
Hardware Specification	Yes	Empirically, based on training logs using two A100 GPUs (40 GB), the fine-tuning stage takes 23,020 s for gender-only, 23,020 s for race-only, and 23,193 s for mixed race gender configurations, while the BPO stage takes roughly 2,365 s, 2,375 s, and 2,988 s, respectively.
Software Dependencies	No	The paper mentions using the LoRA [16] method for finetuning but does not provide specific software names with version numbers for libraries or frameworks like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Base Learning Rate: We start with a fixed learning rate (for example, 1 10 4) for fine-tuning. Batch Size: Ranges from 8 to 32, depending on the setup. Number of Steps: For the first stage, we fine-tune for up to 10 epochs, checking bias and quality after the training. For the second stage, the epoch is chosen to be 1 to 2. We use the Lo RA [16] method for finetuning; the rank is 32 for all experiments.