Text Promptable Surgical Instrument Segmentation with Vision-Language Models

Authors: Zijian Zhou, Oluwatosin Alabi, Meng Wei, Tom Vercauteren, Miaojing Shi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on several surgical instrument segmentation datasets demonstrate our model s superior performance and promising generalization capability. To our knowledge, this is the first implementation of a promptable approach to surgical instrument segmentation, offering significant potential for practical application in the field of robotic-assisted surgery. Code is available at https://github.com/franciszzj/TP-SIS.
Researcher Affiliation Academia 1Department of Informatics, King s College London 2School of Biomedical Engineering & Imaging Sciences, King s College London 3College of Electronic and Information Engineering, Tongji University {first_name}.{last_name}@kcl.ac.uk; mshi@tongji.edu.cn
Pseudocode No The paper describes the method using text and diagrams (e.g., Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/franciszzj/TP-SIS.
Open Datasets Yes We evaluate our method on two endoscopic surgical instrument segmentation datasets: Endo Vis2017 [1], Endo Vis2018 [2].
Dataset Splits Yes For Endo Vis2017, we employ 4-fold cross-validation to assess the model performance. For Endo Vis2018, we adopt the widely-used labeling and dataset partitioning method proposed in [12]. This dataset consists of 15 video sequences, with 11 training and 4 testing sequences, and 7 predefined instrument categories (bipolar forceps, prograsp forceps, large needle driver, monopolar curved scissors, ultrasound probe, suction instrument, clip applier). Additionally, both datasets provide binary and parts segmentation labels. Binary segmentation comprises background tissue and instruments, while parts segmentation distinguishes instrument components as shaft, wrist, and claspers. Besides the two datasets, we have also evaluated our method on the Endo Vis2019 [39] and Cholecseg8k [17] datasets in the supplementary material.
Hardware Specification Yes The model is trained on 4 V100 GPUs, the batch size is 16.
Software Dependencies No The paper mentions software like the 'CLIP model [19]' and 'Adam [23] optimizer', but it does not specify version numbers for any software libraries or frameworks (e.g., PyTorch, TensorFlow, specific CLIP library versions).
Experiment Setup Yes We offer two training/inference default image sizes, 896 896 and 448 448, which are compatible with the size requirements for image patching in the Vi T-based image encoder [8]. During evaluation, we restore the segmentation prediction to the original image size.Feature dimension D is 1024. Following [46], we employ a threshold θ = 0.35 to transform the score map S into mask M, and select the highest-scoring category per pixel in multi-category cases. For hard instrument area reinforcement, we set mask ratio r to 0.25. Finally, the loss weight λ is 0.5. All hyperparameters are determined empirically by segregating 20% of the training data as a validation set following [46]. Training. In our experiments, we adopt the Adam [23] optimizer with a learning rate of 1e 4. We train for 50 epochs, reducing the learning rate to 1e 5 at the 35-th epoch. To enhance the model s generalization, we apply data augmentation techniques to the image, including random crop, horizontal flip, random rotation, and brightness perturbation. The model is trained on 4 V100 GPUs, the batch size is 16.