End-to-end Multi-modal Video Temporal Grounding
Authors: Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on the Charades-STA and Activity Net Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches. |
| Researcher Affiliation | Collaboration | 1University of California, Merced 2Phiar 3Yonsei University 4Google Research |
| Pseudocode | No | The paper describes the proposed framework using text and diagrams (Figure 1, Figure 2) but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and models are available at https://github.com/wenz116/DRFT. |
| Open Datasets | Yes | We conduct extensive experiments on the Charades-STA [10] and Activity Net Captions [17] datasets |
| Dataset Splits | Yes | Activity Net Captions. ...The dataset is split into training, validation and testing set with a ratio of 2:1:1, resulting in 37,421, 17,505 and 17,031 video-query pairs respectively. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions implementing the model in 'Py Torch' and using 'Adam optimizer', but does not provide specific version numbers for PyTorch or any other software libraries/dependencies. |
| Experiment Setup | Yes | The feature dimension c is set to 512. In the contrastive loss (1), the temperature parameter τ is set to 0.1. The projection head h( ) is a 2-layer MLP that project the feature to a 512-dimensional latent space. We implement the proposed model in Py Torch with the Adam optimizer and a fixed learning rate of 4 10 4. |