Multi-granularity Correspondence Learning from Long-term Noisy Videos

Authors: Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, zujie wen, Xi Peng

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on video retrieval, video QA, and action segmentation verify the effectiveness of our method.
Researcher Affiliation Academia 1Sichuan University 2Beijing University of Posts and Telecommunications 3Dalian University of Technology
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://lin-yijie.github.io/projects/Norton.
Open Datasets Yes We conduct the evaluation on You Cook II (Zhou et al., 2018)... MSR-VTT (Xu et al., 2016) is a well-known retrieval benchmark... We use the long video dataset COIN (Tang et al., 2019)... we use the instructional videos How To100M (Miech et al., 2019) for pre-training.
Dataset Splits No The paper mentions using standard benchmarks and a specific number of test pairs for MSR-VTT ("1,000 clip-caption test pairs"), and states that for pre-training it follows the sampling strategy of Video CLIP. However, it does not explicitly provide the specific percentages or sample counts for the training, validation, and test splits for all datasets used, which is necessary for clear reproducibility of data partitioning.
Hardware Specification Yes We implement our method in Py Torch 1.11.0 (Paszke et al., 2019) and conduct all experiments on the Red Hat 6.4.0-1 OS. We train the network for 10 epochs with fp16 precision, which takes approximately 1 A100 GPU day.
Software Dependencies Yes We implement our method in Py Torch 1.11.0 (Paszke et al., 2019) and conduct all experiments on the Red Hat 6.4.0-1 OS.
Experiment Setup Yes We train the network for 10 epochs with fp16 precision, which takes approximately 1 A100 GPU day. We use Adam optimizer (Kingma & Ba, 2014) with the learning rate of 1e 5 to optimize the network. Each training batch consisted of 64 videos, each paired with 16 corresponding clips and captions. We set the balanced weight λ between clip and video loss to 0.1. The log-sum-exp parameter α and the faulty negative exploitation β are set to 1 and 0.3, respectively. We run 50 steps of the Sinkhorn algorithm and set the entropy ε to 0.1 and 1 for calculating the optimal transport in Lvideo and Lclip, respectively.