LGDN: Language-Guided Denoising Network for Video-Language Modeling
Authors: Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, Zhiwu Lu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on five public datasets show that our LGDN outperforms the state-of-the-arts by large margins. We also provide detailed ablation study to reveal the critical importance of solving the noise issue, in hope of inspiring future video-language work. 4 Experiments |
| Researcher Affiliation | Collaboration | 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2Beijing Key Laboratory of Big Data Management and Analysis Methods 3The University of Hong Kong, Pokfulam, Hong Kong 4JD Corporation, Beijing, China |
| Pseudocode | No | The paper describes the model architecture and mathematical formulations but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper explicitly states in its ethics review section that code is not included: 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]'. |
| Open Datasets | Yes | Pre-Training Datasets. Due to the restricted computing resources, we follow COTS [32] to pre-train our LGDN on the pure image-text datasets. Our pre-training datasets consists of Conceptual Captions [42], SBU [39], VG [23] and MSCOCO [28], which contains 5.2 million image-text pairs. We additionally apply CC12M [3] (about 2 million URLs are now invalid) for better performance, which accumulates 15.2 million image-text pairs in total. Downstream Datasets. We evaluate our proposed LGDN on four public video-text retrieval datasets: MSR-VTT [50], MSVD [4], Di De Mo [16], and VATEX [46]. To further demonstrate the general applicability of our LGDN, we also carry out experiments on a public video-question answering dataset: MSRVTT-QA [49]. |
| Dataset Splits | No | The paper mentions evaluating on specific test sets of public datasets (e.g., 'MSR-VTT 1k-A test set') and refers to supplementary material for dataset details, but it does not explicitly describe the training/validation/test splits within the main paper. |
| Hardware Specification | No | The paper states in its ethics review that hardware details are included in the Appendix, but the Appendix content is not provided in the main paper text. The main text itself does not specify exact GPU/CPU models or other specific hardware components used for experiments. |
| Software Dependencies | No | The paper mentions using 'BERT-Base' and 'Vi T-Base' as encoders and 'Adam W' as an optimizer, but does not provide specific software dependencies with version numbers (e.g., library names like PyTorch or TensorFlow, along with their versions). |
| Experiment Setup | Yes | We empirically set the initial learning rate to 1e-5 and adopt Adam W [31] with a weight decay of 0.02 for 5 epochs. In the warm-up stage (first epoch), the model is trained to optimize Eq. (10) without applying SFP mechanism. We also set the other hyper-parameters uniformly as: salient frame numbers Nsalient = 2, mini-batch size |B| = 24, momentum hyper-parameter m = 0.99, temperature τ = 0.07, and queue size Nm = 9, 600. |