Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback

Authors: Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that our method greatly improves the clinical quality of synthesized skin disease images, with outputs aligning with dermatologist assessments. Additionally, augmenting training data with these synthesized images improves diagnostic accuracy by +9.02% on a challenging 20-condition skin disease classification task, and by +13.89% in the few-shot setting.
Researcher Affiliation	Academia	Janet Wang Yunbei Zhang Zhengming Ding Jihun Hamm Tulane University EMAIL
Pseudocode	No	The paper describes the methodology and various components in detailed prose and mathematical formulations within sections like "3.1 Preliminaries", "3.2 Preliminary Diffusion Models Fine-tuning", "3.3 Expert Feedback Curation", and "3.4 Finetuning with Expert Feedback", but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our implementation detail and code is available at https://github.com/janet-sw/MAGIC.git.
Open Datasets	Yes	Following prior work [58], we use the Fitzpatrick17k dataset to evaluate our synthetic augmentation pipeline [23]. Although there are other datasets of clinical photos (e.g., SCIN [62] and DDI[11])... Our MAGIC framework s effectiveness is further validated on the SCIN dataset... To quickly evaluate the cross-dataset generalizability of our method, we identified four overlapping classes between the hospital-grade PAD-UFES-20 dataset and Fitzpatrick17k subset...
Dataset Splits	Yes	For classifier training, we randomly split the dataset into training and hold-out sets at a 50/50 ratio, resulting in 3,100 training and 3,100 test images.
Hardware Specification	Yes	All experiments are conducted five rounds on RTX 6000 Ada GPUs.
Software Dependencies	Yes	We utilize Stable Diffusion v2-1 [44] for image generation. For classification tasks, we employ ResNet18 [24] and DINOv2 [41] as backbone architectures... To automate this evaluation, we instructed an MLLM to analyze each synthesized image based on the target condition s checklist and return a 5-dimensional binary score list, where each dimension corresponds to a criterion s satisfaction (see Appendix C for instruction details). To accommodate both rewardbased and preference-based alignment strategies, we generate a pair of images from each text prompt and submit each single image to the MLLM for this assessment. Thus, the MLLM s score list for each image in a pair individually stands as a sample for RFT, while the pair of score lists can be used for DPO. Examples of this MLLM assessment using Open AI s GPT-4o are illustrated in Fig. 3...
Experiment Setup	Yes	For textual inversion, we learn the text embedding for each skin condition through various prompts... The Adam W optimizer is used with a learning rate of 5 * 10^-4. For LoRA, the rank r is set to 32, and the learning rate is 5 * 10^-6 for Adam W optimizer... We conduct DPO fine-tuning for 128 iterations and for each iteration, 8 pairs (16 images) will be sampled. The denoise strength γ is set to 0.3. The DPO loss will be computed with the feedback. We utilize Adam W optimizer with a learning rate of 0.0001. We utilize the Adam optimizer with a learning rate of 0.01 and a step learning rate scheduler that reduces the learning rate to 0.1 of its previous value every 50 epochs. The classifier is trained for 200 epochs to ensure stable results.