Multilevel Language and Vision Integration for Text-to-Clip Retrieval

Authors: Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, Kate Saenko9062-9069

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and Activity Net Captions. Our full model achieves state-of-the-art performance. We conduct extensive evaluation and ablation studies on two challenging benchmarks: Charades-STA (Gao et al. 2017) and Activity Net Captions (Krishna et al. 2017).
Researcher Affiliation Academia Huijuan Xu,1 Kun He,1 Bryan A. Plummer,1 Leonid Sigal,2 Stan Sclaroff,1 Kate Saenko1 1Boston University, 2University of British Columbia hxu@bu.edu, hekun@bu.edu, bplum@bu.edu, lsigal@cs.ubc.ca, sclaroff@bu.edu, saenko@bu.edu
Pseudocode No The paper describes methods in text and uses figures, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code is released for public use1. 1https://github.com/Vision Learning Group/Text-to-Clip Retrieval
Open Datasets Yes We evaluate our proposed models on one recent dataset designed for the text-to-clip retrieval task, Charades-STA (Gao et al. 2017), and one dataset designed for the dense video caption task which has the data annotations required by the text-to-clip retrieval task, Activity Net Captions dataset (Krishna et al. 2017).
Dataset Splits Yes Activity Net Captions dataset (Krishna et al. 2017) contains around 20k videos and is split into training, validation and testing with a 50%/25%/25% ratio.
Hardware Specification No The paper does not specify any particular hardware components such as GPU or CPU models, memory, or specific cloud instance types used for experiments.
Software Dependencies No The paper mentions software components like 'word2vec', 'Adam optimizer', and 'C3D model', but does not provide specific version numbers for these or other libraries/frameworks.
Experiment Setup Yes We choose λ = 0.5 through cross-validation. The margin parameter η is set to 0.2 in the retrieval loss LRET. During training, each minibatch contains 32 matching sentence-clip pairs sampled from the training set, which are then used to construct triplets. We use the Adam optimizer (Kingma and Ba 2014) with learning rate 0.001 and early stopping on the validation set, for 30 epochs in total. The hidden state size of the LSTM is set to 512. The size of common embedding space in the late fusion retrieval model is 1024.