Weakly-Supervised Video Re-Localization with Multiscale Attention Model

Authors: Yung-Han Huang, Kuang-Jui Hsu, Shyh-Kang Jeng, Yen-Yu Lin11077-11084

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method is evaluated on a public dataset and achieves the state-of-the-art performance under both weakly supervised and fully supervised settings. Experimental Results In this section, we evaluate our method on the benchmark dataset.
Researcher Affiliation Collaboration Yung-Han Huang,1,2 Kuang-Jui Hsu,1,2,3 Shyh-Kang Jeng,2 Yen-Yu Lin1,4 1Academia Sinica, 2National Taiwan University, 3Qualcomm, 4National Chiao Tung University
Pseudocode No The paper describes the method using mathematical equations and textual descriptions but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes The dataset used for video re-localization is collected in (Feng et al. 2018) from Activity Net (Heilbron et al. 2015), which is a large-scale action localization dataset with segment-level action annotations.
Dataset Splits Yes The dataset is split into three disjoint sets including the training, validation and testing sets. The training set contains 160 classes and totally 7,593 videos. There are 7,593 videos of 160 classes in the training set, 978 videos of 20 classes in the validation set and 829 videos of 20 classes in the testing set.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using a pre-trained C3D network, Adam solver, and PCA but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes The model is optimized by an Adam solver with a batch size of 128. The initial learning rate is set as 0.001 and decreased by 10 per 200 iterations. In the self attention layer, we use two heads, i.e. h = 2, and set dk as 32. In the multiscale attention layer, we use 40 heads, i.e. k = 40. The details of the convolutional layer parameters of each head are summarized in Table 1. We adopt dropout before and after the Bi LSTM in the localization predictor, and the dropout rates are set as 0.4 and 0.2, respectively.