ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency
Authors: Pengzhen Ren, Changlin Li, Hang Xu, Yi Zhu, Guangrun Wang, Jianzhuang Liu, Xiaojun Chang, Xiaodan Liang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that View Co outperforms stateof-the-art methods on average by up to 2.9%, 1.6%, and 2.4% m Io U on PASCAL VOC2012, PASCAL Context, and COCO, respectively. |
| Researcher Affiliation | Collaboration | Pengzhen Ren1, Changlin Li2, Hang Xu3, Yi Zhu3, Guangrun Wang4, Jianzhuang Liu3, Xiaojun Chang2, Xiaodan Liang1,5 1Sun Yat-sen University 2Re LER, AAII, University of Technology Sydney 3Huawei Noah s Ark Lab 4University of Oxford 5MBZUAI |
| Pseudocode | No | The paper describes algorithms and formulations but does not include a formally labeled "Pseudocode" or "Algorithm" block. |
| Open Source Code | Yes | 1Code release: https://github.com/pzhren/View Co |
| Open Datasets | Yes | In the training phase, we use CC12M (Changpinyo et al. (2021)) and the filtered YFCC (Thomee et al. (2016)) as training datasets, which contain 12M and 14M image-text pairs, respectively. |
| Dataset Splits | Yes | We evaluate View Co on the task of zero-shot transfer to semantic segmentation on the validation sets of PASCAL VOC 2012 (Everingham et al. (2010)), PASCAL Context (Mottaghi et al. (2014)) and COCO Stuff (Lin et al. (2014)) datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for the experiments. |
| Software Dependencies | No | The paper mentions a deep learning framework, MindSpore, and specific algorithms/models like GroupViT, ViT-S, Transformer, InfoNCE, Adam, and SGDR, but it does not specify version numbers for any software libraries or dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The input image size is 224 224, the patch size is 16 16, and the hidden dimensionality is 384. The 2-stage Group Vi T finally outputs 8 segment tokens, i.e., K = 8). View Co s text encoder ET consists of 12 Transformer layers with a hidden feature dimensionality of 256. The thresholds on PASCAL VOC 2012, PASCAL Context, and COCO are set to 0.95, 0.35, and 0.95, respectively. We resize each input image to have a shorter side of 448. We update the parameters of ft using the exponential moving average (EMA) He et al. (2020b) of the parameters of fs. For example, let θi and θi be the parameters of fs and ft at training step i, respectively, and then θi is updated as: θi = αθi 1 + (1 α)θi, where α is a hyper-parameter for smoothing the update. |