reproducibilityindex.ai

End-to-end Multi-modal Video Temporal Grounding

Authors: Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on the Charades-STA and Activity Net Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
Researcher Affiliation	Collaboration	1University of California, Merced 2Phiar 3Yonsei University 4Google Research
Pseudocode	No	The paper describes the proposed framework using text and diagrams (Figure 1, Figure 2) but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The source code and models are available at https://github.com/wenz116/DRFT.
Open Datasets	Yes	We conduct extensive experiments on the Charades-STA [10] and Activity Net Captions [17] datasets
Dataset Splits	Yes	Activity Net Captions. ...The dataset is split into training, validation and testing set with a ratio of 2:1:1, resulting in 37,421, 17,505 and 17,031 video-query pairs respectively.
Hardware Specification	No	The paper does not specify the exact hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies	No	The paper mentions implementing the model in 'Py Torch' and using 'Adam optimizer', but does not provide specific version numbers for PyTorch or any other software libraries/dependencies.
Experiment Setup	Yes	The feature dimension c is set to 512. In the contrastive loss (1), the temperature parameter τ is set to 0.1. The projection head h( ) is a 2-layer MLP that project the feature to a 512-dimensional latent space. We implement the proposed model in Py Torch with the Adam optimizer and a fixed learning rate of 4 10 4.