Taylor Videos for Action Recognition

Authors: Lei Wang, Xiuyuan Yuan, Tom Gedeon, Liang Zheng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we show that Taylor videos are effective inputs to popular architectures including 2D CNNs, 3D CNNs, and transformers. When used individually, Taylor videos yield competitive action recognition accuracy compared to RGB videos and optical flow. When fused with RGB or optical flow videos, further accuracy improvement is achieved. Additionally, we apply Taylor video computation to human skeleton sequences, resulting in Taylor skeleton sequences that outperform the use of original skeletons for skeletonbased action recognition.
Researcher Affiliation Academia 1School of Computing, Australian National University, Canberra, Australia 2School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Perth, Australia.
Pseudocode Yes Algorithm 1 in Section B of Appendix shows the efficient implementation of Taylor video.
Open Source Code Yes Code is available at: https://github.com/Lei Wang R/video-ar.
Open Datasets Yes First, we use three small-scale datasets: HMDB-51 (Kuehne et al., 2011), MPII Cooking Activities (Rohrbach et al., 2012), and CATER (Girdhar & Ramanan, 2020). ... We then evaluate Taylor videos on large-scale Kinetics (K400 / K600) (Kay et al., 2017) and Something Something v2 (SSv2) (Mahdisoltani et al., 2018). We also evaluate the effectiveness of Taylor skeleton sequences using NTU-60 (Shahroudy et al., 2016), NTU-120 (Liu et al., 2019), and Kinetics-Skeleton (K-Skel) (Yan et al., 2018).
Dataset Splits Yes For HMDB-51, we use the standard 3 train/test splits and report the mean accuracy across 3 splits. For MPII, we use the mean Average Precision (m AP) over 7-fold cross validation. For CATER, we also report the m AP on both static and moving camera setups. ... Hyperparameters such as the number of epochs for training/fine-tuning are determined on the validation sets.
Hardware Specification Yes In this and following two experiments, we use 1 NVIDIA Tesla V100 GPU (with 12 CPUs).
Software Dependencies No No specific software dependencies with version numbers were mentioned.
Experiment Setup Yes Hyperparameters such as the number of epochs for training/fine-tuning are determined on the validation sets. We follow the default settings from the original papers, where we replicate their performance using RGB videos and/or optical flow as input. When implementing Taylor videos for transformer architectures, we add the grayscale frame to each of the displacement, velocity, and acceleration maps. We simply use the displacement concept with 1 term (4 frames per temporal block with a step size of 1) to compute the Taylor skeleton sequences.