Multilevel Language and Vision Integration for Text-to-Clip Retrieval
Authors: Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, Kate Saenko9062-9069
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and Activity Net Captions. Our full model achieves state-of-the-art performance. We conduct extensive evaluation and ablation studies on two challenging benchmarks: Charades-STA (Gao et al. 2017) and Activity Net Captions (Krishna et al. 2017). |
| Researcher Affiliation | Academia | Huijuan Xu,1 Kun He,1 Bryan A. Plummer,1 Leonid Sigal,2 Stan Sclaroff,1 Kate Saenko1 1Boston University, 2University of British Columbia hxu@bu.edu, hekun@bu.edu, bplum@bu.edu, lsigal@cs.ubc.ca, sclaroff@bu.edu, saenko@bu.edu |
| Pseudocode | No | The paper describes methods in text and uses figures, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code is released for public use1. 1https://github.com/Vision Learning Group/Text-to-Clip Retrieval |
| Open Datasets | Yes | We evaluate our proposed models on one recent dataset designed for the text-to-clip retrieval task, Charades-STA (Gao et al. 2017), and one dataset designed for the dense video caption task which has the data annotations required by the text-to-clip retrieval task, Activity Net Captions dataset (Krishna et al. 2017). |
| Dataset Splits | Yes | Activity Net Captions dataset (Krishna et al. 2017) contains around 20k videos and is split into training, validation and testing with a 50%/25%/25% ratio. |
| Hardware Specification | No | The paper does not specify any particular hardware components such as GPU or CPU models, memory, or specific cloud instance types used for experiments. |
| Software Dependencies | No | The paper mentions software components like 'word2vec', 'Adam optimizer', and 'C3D model', but does not provide specific version numbers for these or other libraries/frameworks. |
| Experiment Setup | Yes | We choose λ = 0.5 through cross-validation. The margin parameter η is set to 0.2 in the retrieval loss LRET. During training, each minibatch contains 32 matching sentence-clip pairs sampled from the training set, which are then used to construct triplets. We use the Adam optimizer (Kingma and Ba 2014) with learning rate 0.001 and early stopping on the validation set, for 30 epochs in total. The hidden state size of the LSTM is set to 512. The size of common embedding space in the late fusion retrieval model is 1024. |