LGDN: Language-Guided Denoising Network for Video-Language Modeling

Authors: Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, Zhiwu Lu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on five public datasets show that our LGDN outperforms the state-of-the-arts by large margins. We also provide detailed ablation study to reveal the critical importance of solving the noise issue, in hope of inspiring future video-language work. 4 Experiments
Researcher Affiliation Collaboration 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2Beijing Key Laboratory of Big Data Management and Analysis Methods 3The University of Hong Kong, Pokfulam, Hong Kong 4JD Corporation, Beijing, China
Pseudocode No The paper describes the model architecture and mathematical formulations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper explicitly states in its ethics review section that code is not included: 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]'.
Open Datasets Yes Pre-Training Datasets. Due to the restricted computing resources, we follow COTS [32] to pre-train our LGDN on the pure image-text datasets. Our pre-training datasets consists of Conceptual Captions [42], SBU [39], VG [23] and MSCOCO [28], which contains 5.2 million image-text pairs. We additionally apply CC12M [3] (about 2 million URLs are now invalid) for better performance, which accumulates 15.2 million image-text pairs in total. Downstream Datasets. We evaluate our proposed LGDN on four public video-text retrieval datasets: MSR-VTT [50], MSVD [4], Di De Mo [16], and VATEX [46]. To further demonstrate the general applicability of our LGDN, we also carry out experiments on a public video-question answering dataset: MSRVTT-QA [49].
Dataset Splits No The paper mentions evaluating on specific test sets of public datasets (e.g., 'MSR-VTT 1k-A test set') and refers to supplementary material for dataset details, but it does not explicitly describe the training/validation/test splits within the main paper.
Hardware Specification No The paper states in its ethics review that hardware details are included in the Appendix, but the Appendix content is not provided in the main paper text. The main text itself does not specify exact GPU/CPU models or other specific hardware components used for experiments.
Software Dependencies No The paper mentions using 'BERT-Base' and 'Vi T-Base' as encoders and 'Adam W' as an optimizer, but does not provide specific software dependencies with version numbers (e.g., library names like PyTorch or TensorFlow, along with their versions).
Experiment Setup Yes We empirically set the initial learning rate to 1e-5 and adopt Adam W [31] with a weight decay of 0.02 for 5 epochs. In the warm-up stage (first epoch), the model is trained to optimize Eq. (10) without applying SFP mechanism. We also set the other hyper-parameters uniformly as: salient frame numbers Nsalient = 2, mini-batch size |B| = 24, momentum hyper-parameter m = 0.99, temperature τ = 0.07, and queue size Nm = 9, 600.