Exploiting Visual Semantic Reasoning for Video-Text Retrieval
Authors: Zerun Feng, Zhimin Zeng, Caili Guo, Zheng Li
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two public benchmark datasets validate the effectiveness of our method by achieving state-of-the-art performance due to the powerful semantic reasoning. |
| Researcher Affiliation | Academia | 1Beijing Key Laboratory of Network System Architecture and Convergence, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing, China 2Beijing Laboratory of Advanced Information Networks, Beijing, China |
| Pseudocode | No | The paper describes its method using mathematical equations and textual explanations, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing open-source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We evaluate our model on two benchmark datasets: the MSRVTT dataset [Xu et al., 2016] and the MSVD dataset [Chen and Dolan, 2011]. |
| Dataset Splits | Yes | We follow the identical partition strategy in [Dong et al., 2019] for training, testing and validation in our experiments. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer', 'Res Net-101', and 'Faster RCNN', but it does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | We uniformly sample 16 frames from videos with the same time interval between every two frames. The number n of regions within a frame is 36, identical to [Anderson et al., 2018]. The dimension d of region features extracted from Res Net-101 is 2048. ... We set the word embedding size to 500 and the dimension of the common space D to 2048, similar to [Dong et al., 2019]. The margin parameter α is empirically chosen to be 0.2. The size of a mini-batch is 64. The optimizer in the training procedure is Adam with 50 epochs at most. We start training with an initial learning rate 0.0001, and the adjustment schedule is that once the validation loss does not decrease in three consecutive epochs, the learning rate is divided by 2. |