PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology

Authors: Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Lin Sun, Zhongyi Shui, Yunlong Zhang, Honglin Li, Lin Yang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results of Path Asst show the potential of harnessing AI-powered generative foundation model to improve pathology diagnosis and treatment processes.
Researcher Affiliation Academia 1College of Computer Science and Technology, Zhejiang University, China 2Research Center for Industries of the Future and School of Engineering, Westlake University, China 3Department of Computer Science and Engineering, The Ohio State University, USA 4School of Computer and Computing Science, Hangzhou City University, China
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We open-source our dataset, as well as a comprehensive toolkit for extensive pathology data collection and preprocessing at https://github.com/superjamessyx/Generative Foundation-AI-Assistant-for-Pathology.
Open Datasets Yes We open-source our dataset, as well as a comprehensive toolkit for extensive pathology data collection and preprocessing at https://github.com/superjamessyx/Generative Foundation-AI-Assistant-for-Pathology. CRC100K dataset (Kather, Halama, and Marx 2018): This is a collection of 100K image patches derived from H&E stained histological images of both colorectal cancer and normal tissue... WSSS4LUAD (Han et al. 2022)... LC25000 (Borkowski et al. 2019).
Dataset Splits No The paper describes the datasets used for training (Path Cap, Path Instruct) and for evaluation (Path VQA, CRC100K, WSSS4LUAD, LC25000), but does not explicitly provide the specific train/validation/test splits (e.g., percentages or sample counts for each split) for its primary training datasets (Path Cap and Path Instruct) needed for reproduction.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models, or memory specifications.
Software Dependencies No The paper mentions several models and frameworks used (e.g., Conv Ne Xt, YOLOv7, Vicuna-13B, CLIP, Stable Diffusion), but it does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x, specific library versions).
Experiment Setup Yes Path Asst is trained using the Path Instruct dataset through a two-phase training. In the first phase, both the vision encoder and the LLM are frozen, and we only train the FC layer that connects to the vision encoder. This initial phase aims to preliminarily align the vision encoder with the LLM. During this phase, we utilize the detailed description-based part of the Path Instruct. In the second phase, with an aspiration for Path Asst to generate higher-quality and more detailed responses, we extract all the data from books within the Path Instruct dataset, and include samples from Pub Med with single images and captions exceeding the length of 50 tokens, resulting in a total training set of 35K samples. Only the Path CLIP is frozen during this phase s training. Specifically, we standardize both forms of instructfollowing data formats, as shown in Table 1. First, we predefine a system message that sets the context for the LLM role. This is followed by a conversation between the user and the assistant, where the user provides instructions, and the assistant responds accordingly based on the instructions. To finetune our model, we utilize instruction-tuning via next-word prediction. Specifically, the model is trained to optimize the likelihood of generating an accurate response given the input image I and instruction Xinstruct. The loss is calculated using the negative log-likelihood of the correct next token in the sequence, with the total loss summed across all time steps, which can be formulated as: t=1 log p (xt | I, Xinstruct, Xa,<t; θ) , (1) Where Xa,<t refers to the prior tokens in the response sequence, θ denotes the trainable parameters of Path Asst. Specifically, during the first phase of training, θ corresponds to the parameters of the FC layer. In the subsequent phase, it represents both FC layer and LLM parameters. Meanwhile, T signifies the length of the ground-truth response, and p (xt | I, Xinstruct, Xa,<t; θ) represents the probability of generating the t-th token in the response sequence.