Taylor Videos for Action Recognition
Authors: Lei Wang, Xiuyuan Yuan, Tom Gedeon, Liang Zheng
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we show that Taylor videos are effective inputs to popular architectures including 2D CNNs, 3D CNNs, and transformers. When used individually, Taylor videos yield competitive action recognition accuracy compared to RGB videos and optical flow. When fused with RGB or optical flow videos, further accuracy improvement is achieved. Additionally, we apply Taylor video computation to human skeleton sequences, resulting in Taylor skeleton sequences that outperform the use of original skeletons for skeletonbased action recognition. |
| Researcher Affiliation | Academia | 1School of Computing, Australian National University, Canberra, Australia 2School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Perth, Australia. |
| Pseudocode | Yes | Algorithm 1 in Section B of Appendix shows the efficient implementation of Taylor video. |
| Open Source Code | Yes | Code is available at: https://github.com/Lei Wang R/video-ar. |
| Open Datasets | Yes | First, we use three small-scale datasets: HMDB-51 (Kuehne et al., 2011), MPII Cooking Activities (Rohrbach et al., 2012), and CATER (Girdhar & Ramanan, 2020). ... We then evaluate Taylor videos on large-scale Kinetics (K400 / K600) (Kay et al., 2017) and Something Something v2 (SSv2) (Mahdisoltani et al., 2018). We also evaluate the effectiveness of Taylor skeleton sequences using NTU-60 (Shahroudy et al., 2016), NTU-120 (Liu et al., 2019), and Kinetics-Skeleton (K-Skel) (Yan et al., 2018). |
| Dataset Splits | Yes | For HMDB-51, we use the standard 3 train/test splits and report the mean accuracy across 3 splits. For MPII, we use the mean Average Precision (m AP) over 7-fold cross validation. For CATER, we also report the m AP on both static and moving camera setups. ... Hyperparameters such as the number of epochs for training/fine-tuning are determined on the validation sets. |
| Hardware Specification | Yes | In this and following two experiments, we use 1 NVIDIA Tesla V100 GPU (with 12 CPUs). |
| Software Dependencies | No | No specific software dependencies with version numbers were mentioned. |
| Experiment Setup | Yes | Hyperparameters such as the number of epochs for training/fine-tuning are determined on the validation sets. We follow the default settings from the original papers, where we replicate their performance using RGB videos and/or optical flow as input. When implementing Taylor videos for transformer architectures, we add the grayscale frame to each of the displacement, velocity, and acceleration maps. We simply use the displacement concept with 1 term (4 frames per temporal block with a step size of 1) to compute the Taylor skeleton sequences. |