Long Short-Term Transformer for Online Action Detection

Authors: Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Xia, Zhuowen Tu, Stefano Soatto

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Compared to prior work, LSTR provides an effective and efficient method to model long videos with fewer heuristics, which is validated by extensive empirical analysis. LSTR achieves state-of-the-art performance on three standard online action detection benchmarks, THUMOS 14, TVSeries, and HACS Segment.
Researcher Affiliation Industry Mingze Xu Yuanjun Xiong Hao Chen Xinyu Li Wei Xia Zhuowen Tu Stefano Soatto Amazon/AWS AI {xumingze,yuanjx,hxen,xxnl,wxia,ztu,soattos}@amazon.com
Pseudocode No The paper describes the architecture and processes of LSTR in prose and through diagrams (Figure 1, Figure 2), but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes Code has been made available at: https://xumingze0308.github.io/projects/lstr.
Open Datasets Yes We evaluate our model on three publicly-available datasets: THUMOS 14 [30], TVSeries [14] and HACS Segment [76].
Dataset Splits Yes THUMOS 14 ... train on the validation set (200 untrimmed videos) and evaluate on the test set (213 untrimmed videos). ... HACS Segment ... It contains 35,300 untrimmed videos over 200 human action classes for training and 5,530 untrimmed videos for validation.
Hardware Specification Yes We implemented our proposed model in Py Torch [1], and performed all experiments on a system with 8 Nvidia V100 graphics cards.
Software Dependencies No We implemented our proposed model in Py Torch [1]. The paper mentions PyTorch but does not provide a specific version number, which is required for a reproducible description of software dependencies.
Experiment Setup Yes For all Transformer units, we set their number of heads as 16 and hidden units as 1024 dimensions. To learn model weights, we used the Adam [34] optimizer with weight decay 5 10 5. The learning rate was linearly increased from zero to 5 10 5 in the first 2/5 of training iterations and then reduced to zero following a cosine function. Our models were optimized with batch size of 16, and the training was terminated after 25 epochs.