Auto-Encoding Knowledge Graph for Unsupervised Medical Report Generation

Authors: Fenglin Liu, Chenyu You, Xian Wu, Shen Ge, Sheng wang, Xu Sun

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments show that the unsupervised KGAE generates desirable medical reports without using any image-report training pairs.
Researcher Affiliation Collaboration 1MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University 2School of ECE, Peking University 3Paul G. Allen School of Computer Science and Engineering, University of Washington 4Department of Electrical Engineering, Yale University 5Tencent
Pseudocode No The paper describes algorithms using equations and textual explanations but does not include formal pseudocode blocks or algorithm figures.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See our supplemental material.
Open Datasets Yes In detail, to train our knowledge-driven encoder KEI and KER (see Eq. (4)), we feed the GI and GR into a common multi-label classification network [54, 15] trained with binary cross entropy loss for 14 common radiographic observations classification3. In this way, our encoder can extract the knowledge representations GI and GR of both image and report in a common latent space, effectively bridging the vision and the language domains. To train the knowledge-driven decoder, as well as the knowledge bank B (see Eq. (5)), since there are no coupled image-report pairs, we propose to reconstruct the report R based on the GR. Therefore, through Eq. (5), taking the input report R = {r1, r2, . . . , r T } as the ground truth report, we can train our approach by minimizing the cross-entropy loss: t=1 log (p (rt | r1:t 1)) . (8) In this way, we can train our decoder in the R GR R auto-encoding pipeline. During testing, we first adopt the knowledge-driven encoder to extract the knowledge representations GI of the test image (see Eq. (4)). Then, we directly feed GI into the decoder to generate final report in the I GI R pipeline (see Eq. (5)). In this way, our approach can relax the reliance on the image-report pairs. In our following experiments, we validate the effectiveness of our approach, which even outperforms some supervised approaches.
Dataset Splits Yes Following [25], our method also focuses on the findings section as it is the most important component of reports. Then, following [16, 22, 25, 6], we randomly select 70%-10%-20% image-report pairs of dataset to form the trainingvalidation-testing sets. The MIMIC-CXR [17] includes 377,110 chest x-ray images associated with 227,835 reports. The dataset is officially split into 368,960 images (222,758 reports) for training, 2,991 images (1,808 reports) for validation and 5,159 images (3,269 reports) for testing.
Hardware Specification Yes All re-implementations and our experiments were run on 8 V100 GPUs.
Software Dependencies No The paper mentions software components like ResNet-50, Transformer, and Adam optimizer, but does not specify their version numbers or the versions of underlying libraries/frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes The size d is set to 256. For the attention mechanism in Eq. (4) and Eq. (5), we adopt the multi-head attention [38], the number of heads n in multi-head attention is set to 8. The intermediate dimension in F, i.e., Eq. (4), is set to 1024. Based on the average performance on the validation set, the NB in knowledge bank, i.e., Eq. (6), is set to 10,000. The NKG in the knowledge graph is set to 200. In our knowledge-driven encoder, the image embedding module adopts the Res Net-50 [12] pretrained on Image Net [10] and fine-tuned on Che Xpert dataset [14] to extract the image embeddings in the shape of 7 7 2048, which will be projected to d = 256, acquiring I R49 256, i.e., NI = 49; The report embedding module is implemented by the Transformer [38] equipping with the self-attention mechanism provided in Lin et al. [27]; NR = NI = 49. In both unsupervised and (semi-)supervised settings (see Section 3.3), the batch size is set to 16 and Adam optimizer [18] with a learning rate of 1e-4 is used for parameter optimization.