SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation

Authors: Wenxi Yue, Jing Zhang, Kun Hu, Yong Xia, Jiebo Luo, Zhiyong Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The results of extensive experiments on both Endo Vis2018 and Endo Vis2017 datasets demonstrate that Surgical SAM achieves state-of-the-art performance while only requiring a small number of tunable parameters. We conduct extensive experiments on the challenging Endo Vis2018 and Endo Vis2017 datasets, achieving state-of-the-art (SOTA) performance while significantly improving training efficiency.
Researcher Affiliation Academia 1School of Computer Science, The University of Sydney 2School of Computer Science, Northwestern Polytechnical University 3Department of Computer Science, University of Rochester {wenxi.yue, jing.zhang1, kun.hu, zhiyong.wang}@sydney.edu.au, yxia@nwpu.edu.cn, jluo@cs.rochester.edu
Pseudocode No The paper describes the methodology in prose and mathematical equations but does not include structured pseudocode or an algorithm block.
Open Source Code Yes The source code is available at https://github.com/wenxi-yue/Surgical SAM.
Open Datasets Yes We use the Endo Vis2018 (Allan et al. 2020) and Endo Vis2017 (Allan et al. 2019) datasets and adhere to the standard protocols defined by Shvets et al. (2018) and Gonz alez, Bravo-S anchez, and Arbelaez (2020).
Dataset Splits Yes Endo Vis2017 consists of eight videos, each with 255 frames, for which we perform 4-fold cross-validation following Shvets et al. (2018). Endo Vis2018 offers 11 training videos and four validation videos with each consisting of 149 frames.
Hardware Specification Yes Our model is implemented using Py Torch and trained and evaluated on an Nvidia Tesla V100 16GB GPU.
Software Dependencies No Our model is implemented using Py Torch and trained and evaluated on an Nvidia Tesla V100 16GB GPU. (PyTorch version is not specified).
Experiment Setup Yes For the prototype-based prompt encoder, the intermediate dimensions r D and r S are both set to 128 and the number of tokens per class n is set to 2 and 4 for Endo Vis2018 and Endo Vis2017, respectively. For prototype contrastive loss, a temperature τ of 0.07 is used. We employ an Adam optimiser with a learning rate of 0.001 and 0.0001 for Endo Vis2018 and Endo Vis2017, respectively. To reduce computational load, we adopt pre-computed image embeddings in training, employing a batch size of 32.