Aligning Synthetic Medical Images with Clinical Knowledge using Human Feedback

Authors: Shenghuan Sun, Greg Goldgof, Atul Butte, Ahmed M. Alaa

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings suggest that incorporating pathologist feedback significantly enhances the quality of synthetic images in terms of all existing quality metrics such as fidelity, accuracy of downstream predictive models, and clinical plausibility as evaluated by experts. Additionally, it also improves qualities that are not directly addressed in the pathologist evaluation, such as the diversity of synthetic samples.
Researcher Affiliation Academia Shenghuan Sun University of California, San Francisco shenghuan.sun@ucsf.edu Gregory M. Goldgof Memorial Sloan Kettering Cancer Center goldgofg@mskcc.org Atul Butte University of California, San Francisco atul.butte@ucsf.edu Ahmed M. Alaa UC Berkeley and UCSF amalaa@berkeley.edu
Pseudocode Yes Algorithm 1 Training the reward model ... Algorithm 2 Pretraining the conditional diffusion model for generating synthetic images ... Algorithm 3 Finetuning the conditional diffusion model using pathologist feedback ... Algorithm 4 Incorporating new clinical concepts into the model
Open Source Code No The paper mentions using a "public repository (https://github.com/openai/improved-diffusion.git)" for their finetuning pipeline, but this is a third-party tool they utilized, not their own open-sourced code for the specific methodology described in the paper.
Open Datasets No In all experiments, we used a dataset of hematopathologist consensus-annotated single-cell images extracted from bone marrow aspirate (BMA) whole slide images. The images were obtained from the clinical archives of an academic medical center.
Dataset Splits No Training was conducted using 128 images per cell type, with 32 images per cell type held out for testing and evaluating all performance metrics. The paper only specifies train and test splits, without explicitly defining a separate validation split or its size/counts.
Hardware Specification Yes The model is trained in half-precision on 2 × 24 GB NVIDIA GPUs, with a per-GPU batch size of 16, resulting in a toal batch size of 32.
Software Dependencies No The paper mentions using a public repository, the Adam optimizer [49], and the ResNeXt-50 architecture, but it does not specify version numbers for any software libraries, frameworks, or programming languages used (e.g., PyTorch version, Python version).
Experiment Setup Yes We used a learning rate of 10-4, and an exponential moving average over parameters with a rate of 0.9999. ... The model is trained in half-precision on 2 × 24 GB NVIDIA GPUs, with a per-GPU batch size of 16, resulting in a toal batch size of 32.