Aligning Synthetic Medical Images with Clinical Knowledge using Human Feedback
Authors: Shenghuan Sun, Greg Goldgof, Atul Butte, Ahmed M. Alaa
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings suggest that incorporating pathologist feedback significantly enhances the quality of synthetic images in terms of all existing quality metrics such as fidelity, accuracy of downstream predictive models, and clinical plausibility as evaluated by experts. Additionally, it also improves qualities that are not directly addressed in the pathologist evaluation, such as the diversity of synthetic samples. |
| Researcher Affiliation | Academia | Shenghuan Sun University of California, San Francisco shenghuan.sun@ucsf.edu Gregory M. Goldgof Memorial Sloan Kettering Cancer Center goldgofg@mskcc.org Atul Butte University of California, San Francisco atul.butte@ucsf.edu Ahmed M. Alaa UC Berkeley and UCSF amalaa@berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Training the reward model ... Algorithm 2 Pretraining the conditional diffusion model for generating synthetic images ... Algorithm 3 Finetuning the conditional diffusion model using pathologist feedback ... Algorithm 4 Incorporating new clinical concepts into the model |
| Open Source Code | No | The paper mentions using a "public repository (https://github.com/openai/improved-diffusion.git)" for their finetuning pipeline, but this is a third-party tool they utilized, not their own open-sourced code for the specific methodology described in the paper. |
| Open Datasets | No | In all experiments, we used a dataset of hematopathologist consensus-annotated single-cell images extracted from bone marrow aspirate (BMA) whole slide images. The images were obtained from the clinical archives of an academic medical center. |
| Dataset Splits | No | Training was conducted using 128 images per cell type, with 32 images per cell type held out for testing and evaluating all performance metrics. The paper only specifies train and test splits, without explicitly defining a separate validation split or its size/counts. |
| Hardware Specification | Yes | The model is trained in half-precision on 2 × 24 GB NVIDIA GPUs, with a per-GPU batch size of 16, resulting in a toal batch size of 32. |
| Software Dependencies | No | The paper mentions using a public repository, the Adam optimizer [49], and the ResNeXt-50 architecture, but it does not specify version numbers for any software libraries, frameworks, or programming languages used (e.g., PyTorch version, Python version). |
| Experiment Setup | Yes | We used a learning rate of 10-4, and an exponential moving average over parameters with a rate of 0.9999. ... The model is trained in half-precision on 2 × 24 GB NVIDIA GPUs, with a per-GPU batch size of 16, resulting in a toal batch size of 32. |