Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CAMILA: Context-Aware Masking for Image Editing with Language Alignment

Authors: Hyunseung Kim, Chiho Choi, Srikanth Malla, Sai Padmanabhan, Saurabh Bagchi, Joon Hee Choi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For comprehensive evaluation of this new method, we constructed datasets for both singleand multi-instruction image editing, incorporating the presence of infeasible requests. Our method achieves better performance and higher semantic alignment than state-of-the-art models, demonstrating its effectiveness in handling complex instruction challenges while preserving image integrity. Section 5, titled "Evaluation", contains quantitative and qualitative results, including metrics, baselines, and ablation studies, which are characteristic of experimental research.
Researcher Affiliation	Collaboration	The authors are affiliated with "Samsung Semiconductor, USA" (an industry entity) and "Purdue University" (an academic institution), indicating a collaboration between industry and academia.
Pseudocode	No	The paper describes the methods and architecture in prose and through diagrams (Figure 2, Figure 3) within the main body, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, step-by-step procedures.
Open Source Code	Yes	Our source code is available at https://github.com/hk-repo/CAMILA.
Open Datasets	Yes	For evaluating single instruction tasks, we use the Magic Brush [47] dataset, which covers both single-turn and multi-turn scenarios as detailed in Section 6, along with the EMU [40] dataset.
Dataset Splits	No	The paper mentions using Magic Brush [47] and EMU [40] datasets, and creating new datasets for multi-instruction and context-aware instruction editing. However, it does not explicitly provide specific percentages, sample counts, or methodology for training, validation, or test splits for any of these datasets in the provided text.
Hardware Specification	No	The paper does not explicitly specify the hardware (e.g., GPU models, CPU types, memory amounts, or cloud instance types) used for running the experiments. While it mentions inference time is reported in Section D.2, this section is not provided, and there are no direct hardware specifications in the accessible text.
Software Dependencies	No	The paper refers to various models and frameworks such as 'Stable Diffusion [38]', 'MLLM [27]', and 'CLIP [35]' but does not provide specific version numbers for software dependencies like programming languages, libraries, or development environments (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup	Yes	The training of our MLLM-based approach is optimized with four primary loss components, each designed to target a specific aspect of model performance for accurate token classification, alignment, and mask generation. The total loss Lmain is formulated as follows: Lmain = λ1Ltoken CE + λ2Lbroadcast CE + λ3Ldice + λ4LBCE, where λ1, λ2, λ3, λ4 are hyperparameters that balance the influence of each loss component. ... To efficiently fine-tune the pre-trained MLLM while preserving its learned knowledge, we adopt the Low-Rank Adaptation technique [15]. In our training, we freeze the vision backbone and text encoder of the MLLM, while the remaining parts of the model are fine-tuned. ... The updated loss Lupdated is defined as: Lmain + λ5LMSE, where LMSE is the MSE loss between the predicted and oracle CLIP-T score, and λ5 is a hyperparameter controlling the weight of the CLIP-T score loss.