Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding

Authors: Zhu Zhang, Zhou Zhao, Zhijie Lin, jieming zhu, Xiuqiang He

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on five vision-language grounding datasets verify the effectiveness of our CCL paradigm.
Researcher Affiliation Collaboration Zhu Zhang1,2 , Zhou Zhao1,2 , Zhijie Lin1 , Jieming Zhu3 , and Xiuqiang He3 1Zhejiang University 2Key Laboratory Foundation of Information Perception and Systems for Public Security of MIIT, Nanjing University of Science and Technology 3Huawei Noah s Ark Lab
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include an unambiguous statement about releasing source code or a direct link to a code repository.
Open Datasets Yes We conduct experiments on two large-scale video grounding datasets Activity Caption [1] and Charades-STA [14], and three large-scale image grounding datasets Ref COCO [47], Ref COCO+ [47] and Ref COCOg [28]. The dataset details are introduced in Section 2 of the supplementary material.
Dataset Splits No The paper mentions evaluating on datasets and provides performance tables for 'Val', 'Test A', 'Test B', but it does not specify exact split percentages or sample counts for training, validation, and test sets. It implies using standard splits but does not provide details for reproduction.
Hardware Specification No The paper does not explicitly describe the hardware specifications (e.g., specific GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using 'Adam optimizer' and 'pre-trained Glove embedding' but does not specify version numbers for any software dependencies.
Experiment Setup Yes As for the hyper-parameters in CCL, we set the proposal number M in P+/P to 32 for video grounding and 12 for image grounding. In FLS and ILS, we set the number B of memory vectors to 100, where the coefficient α of the momentum update is set to 0.9. To avoid time-consuming training, the number J of RCT/DCT is set to 3, where each type of transformation strategies is only applied once and produces a counterfactual result. During MIL-based pretraining, we use an Adam optimizer [13] with the initial learning rate 0.001. We then use another Adam optimizer with the initial learning rate 0.0005 for the CCL training. ... where mil is a margin value which is set to 1.0. ... β is set to 0.01 to balance two losses. ... where rank is a margin value which is set to 0.6. ... where τ is the softmax temperature, which is set to 0.5 ... where we set λ to 0.2 for the balance of two losses.