A Simple Image Segmentation Framework via In-Context Examples

Authors: Yang Liu, Chenchen Jing, Hengtao Li, Muzhi Zhu, Hao Chen, Xinlong Wang, Chunhua Shen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on various segmentation tasks show the effectiveness of the proposed method. Our code is released at: https://github.com/aim-uofa/SINE
Researcher Affiliation Collaboration Yang Liu1, Chenchen Jing1, Hengtao Li1, Muzhi Zhu1 Hao Chen1 , Xinlong Wang3, Chunhua Shen1,2 1Zhejiang University, China 2Ant Group 3Beijing Academy of Artificial Intelligence
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Our code is released at: https://github.com/aim-uofa/SINE
Open Datasets Yes Training Data We train our model with a diverse set of segmentation datasets, including semantic, instance, and panoptic segmentation. Specifically, we utilize three visual perception datasets: ADE20K [65] is a popular semantic segmentation dataset... COCO [31] is a widely-used dataset... Objects365 [51] is a large-scale high-quality object detection dataset.
Dataset Splits Yes ADE20K [65] is a popular semantic segmentation dataset... It has 25K images, including 20K for training, 2K for validation, and 3K for testing.
Hardware Specification Yes Our model is trained for 5 days by using 8 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions software like DINOv2 and Adam optimizer, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We train SINE about 50K steps with 64 batch sizes. We use Adam [36] optimizer and employ β1 = 0.9, β2 = 0.999 for optimization. We use a linear learning rate scheduler with a base learning rate of 1e 4 and a warmup of 100 steps. The weight decay is set to 0.05. For data augmentation, we use random horizontal flipping and the large-scale jittering (LSJ) [13] augmentation with a random scale sampled from range 0.1 to 2.0 followed by a fixed size crop to 896 896.