SMART Frame Selection for Action Recognition

Authors: Shreyank N Gowda, Marcus Rohrbach, Laura Sevilla-Lara1451-1459

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our SMART frame selection network on several trimmed action recognition datasets, including Somethingsomething, UCF101 and subsets of Kinetics. We observe that in all of them the proposed method outperforms the baselines, including using the full video, while reducing the computational cost by a factor of 4 to 10, depending on the dataset. We also test the proposed method on the untrimmed setting in Activity Net and FCVID, where we get higher accuracies than all previous work on frame selection. Further, we extend our frame selection approach to select frames that are then passed at test time to deep action recognition models and show that we obtain state-of-the-art results on UCF101 and HMDB51 which are trimmed video datasets, showing that frame selection can be an important step to improve accuracy in trimmed action recognition.
Researcher Affiliation Collaboration Shreyank N Gowda1 Marcus Rohrbach2 Laura Sevilla-Lara1 1University of Edinburgh 2Facebook AI Research
Pseudocode No The paper includes mathematical equations and a diagram (Figure 2) illustrating computational steps, but it does not present a formal pseudocode block or algorithm labeled as such.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology or a direct link to a code repository.
Open Datasets Yes We use 6 of the most popular benchmark datasets throughout our experimental analysis. We use the Something-something-v2 dataset (Goyal et al. 2017) for our extensive ablation study. ... The Kinetics(Carreira and Zisserman 2017) dataset is one of the most widely used large-scale datasets in action recognition. ... we also use the well-known UCF101(Soomro, Zamir, and Shah 2012) dataset which contains 101 classes and about 13K videos. We also extend our approach as a pre-processing step for more complex models and compare performances on HMDB51 which contains 51 classes and 6849 video clips along with UCF101. ... Activity Net(Caba Heilbron et al. 2015) and FCVID(Jiang et al. 2017b).
Dataset Splits Yes The Something-something dataset has a total of 168,913 training videos and 24,777 validation videos with a total of 174 classes. ... As the testing labels are not available publicly, the reported performances are on the validation set.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU or CPU models, memory, or cloud computing specifications.
Software Dependencies No The paper mentions 'Pytorch for implementation' and the use of 'Mobile Net' and 'Glo Ve', along with specific backbone architectures like 'Res Net-152' and 'Inception-v3', but it does not specify version numbers for any of these software dependencies.
Experiment Setup Yes All frames are resized to 224x224. We use mini-batch stochastic gradient descent, with a momentum of 0.9. We run 200 epochs on UCF101 and the Kinetics subsets, and 100 epochs on Something-something dataset and Activitynet due to the computational requirements for these larger scale datasets. We use a batch size of 128 for UCF101 and the Kinetics subsets and a batch size of 64 for the Activitynet and something-something datasets. The initial learning rate is set at 0.0001 and reduces by 10 after every 25 epochs.