Semantic Proposal for Activity Localization in Videos via Sentence Query
Authors: Shaoxiang Chen, Yu-Gang Jiang8199-8206
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our algorithm on the TACo S dataset and the Charades-STA dataset. Experimental results show that our algorithm outperforms existing methods on both datasets, and at the same time reduces the number of proposals by a factor of at least 10. |
| Researcher Affiliation | Academia | Shaoxiang Chen, Yu-Gang Jiang Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University Shanghai Institute of Intelligent Electronics & Systems {sxchen13, ygj}@fudan.edu.cn |
| Pseudocode | Yes | Algorithm 1 Semantic Activity Proposal Generation |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | TACo S (Regneri et al. 2013). The TACo S dataset is built on the MPII Cooking Composite Activities (Rohrbach et al. 2012b; 2012a), which contains fine-grained temporal annotations of cooking activities. There are 127 videos in the dataset. Following previous work, we split the dataset into training, validation and test sets with 75, 27 and 25 videos, respectively. Each annotation contains one sentence and the start and end time of the activity it describes in the video. The numbers of annotations in training, validation and test sets are 10146, 4589 and 4083, respectively. The average length of the sentences is 6.2 words, the average duration of the videos is 287.1 seconds, and the average number of activities per video is 21.4. Charades-STA (Gao et al. 2017a). The Charades-STA dataset is built on the Charades (Sigurdsson et al. 2016) dataset, which contains 9848 videos of daily indoors activities collected through Amazon Mechanical Turk. There are 16128 clip-sentence pairs in the released Charades STA dataset, which are split into training and test sets of 12408 and 3720 clip-sentence pairs, respectively. The average length of the sentences is 8.6 words, the average duration of the videos is 29.8 seconds, and the average number of activities per video is 2.3. |
| Dataset Splits | Yes | Following previous work, we split the dataset into training, validation and test sets with 75, 27 and 25 videos, respectively. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU model, CPU type, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like "VGG16 network", "Momentum algorithm", "Adam algorithm", "Skip-thought", and "Glove" but does not specify their version numbers. |
| Experiment Setup | Yes | We use the Momentum algorithm with a learning rate of 10^-5 and batch size of 16 to train the visual concept detector. In the proposal evaluation module, the visual feature is extracted from the visual concept detector’s fc6 layer. The number of clusters for VLAD is 64 and the number of units for LSTM is 1024. The hyper-parameters in the losses, α and β are 0.015 and 0.01, respectively. The final loss is optimized by the Adam algorithm with a learning rate of 10^-4 and batch size of 64. During training, the proposals are generated by dense sliding window method. For each annotation, we generate sliding windows of length [64, 128, 256, 512] frames for the video to cover the annotated temporal region. Only windows having temporal Io U ≥ 0.5 are used for training. For evaluation, the generated proposal lengths are in [128, 256] (decided based on the statistics of the datasets). |