End-to-end Multi-modal Video Temporal Grounding

Authors: Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on the Charades-STA and Activity Net Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
Researcher Affiliation Collaboration 1University of California, Merced 2Phiar 3Yonsei University 4Google Research
Pseudocode No The paper describes the proposed framework using text and diagrams (Figure 1, Figure 2) but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes The source code and models are available at https://github.com/wenz116/DRFT.
Open Datasets Yes We conduct extensive experiments on the Charades-STA [10] and Activity Net Captions [17] datasets
Dataset Splits Yes Activity Net Captions. ...The dataset is split into training, validation and testing set with a ratio of 2:1:1, resulting in 37,421, 17,505 and 17,031 video-query pairs respectively.
Hardware Specification No The paper does not specify the exact hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The paper mentions implementing the model in 'Py Torch' and using 'Adam optimizer', but does not provide specific version numbers for PyTorch or any other software libraries/dependencies.
Experiment Setup Yes The feature dimension c is set to 512. In the contrastive loss (1), the temperature parameter τ is set to 0.1. The projection head h( ) is a 2-layer MLP that project the feature to a 512-dimensional latent space. We implement the proposed model in Py Torch with the Adam optimizer and a fixed learning rate of 4 10 4.