Completing Visual Objects via Bridging Generation and Segmentation

Authors: Xiang Li, Yinpeng Chen, Chung-Ching Lin, Hao Chen, Kai Hu, Rita Singh, Bhiksha Raj, Lijuan Wang, Zicheng Liu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments for analysis and comparison, the results of which indicate the strength and robustness of Mask Comp against previous methods, e.g., Stable Diffusion.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Microsoft 3MBZUAI.
Pseudocode No The paper includes diagrams and descriptive text for its processes but no formally structured pseudocode or algorithm blocks.
Open Source Code Yes The code will be made publicly available.
Open Datasets Yes We evaluate Mask Comp on two popular datasets: AHP (Zhou et al., 2021a) and DYCE (Ehsani et al., 2018). For both datasets, the non-occluded object and its corresponding mask for each object are available. We train Mask Comp on the AHP and a filtered subset of Open Image v6 (Kuznetsova et al., 2020).
Dataset Splits No The paper mentions using specific datasets for training and evaluation but does not provide explicit percentages or counts for training, validation, and test splits.
Hardware Specification Yes Table 3c reports the inference time of each component in IMD (with a single NVIDIA V100 GPU).
Software Dependencies No The paper mentions software components like Stable Diffusion, Adam, BLIP, SAM, Control Net, and Swin-Transformer, but generally lacks specific version numbers for these dependencies required for reproduction.
Experiment Setup Yes For the generation stage, we train the Comp Net with frozen Stable Diffusion (Rombach et al., 2022) on the AHP dataset for 50 epochs. The learning rate is set for 1e-5. We adopt batchsize = 8 and an Adam (Loshchilov & Hutter, 2017) optimizer. The image is resized to 512 x 512 for both training and inference. The object is cropped and resized to have the longest side 360 before sticking on the image. For a more generalized setting, we train the Comp Net on a subset of the Open Image (Kuznetsova et al., 2020) dataset for 36 epochs. ... We vote mask with a threshold of τ = 0.5. During inference, if no other specification, we conduct the IMD process for 5 steps with N = 5 images for each step.