TAda! Temporally-Adaptive Convolutions for Video Understanding

Authors: Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Mingqian Tang, Ziwei Liu, Marcelo H Ang Jr

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We construct TAda2D and TAda Conv Ne Xt networks by replacing the 2D convolutions in Res Net and Conv Ne Xt with TAda Conv, which leads to at least on par or better performance compared to state-of-the-art approaches on multiple video action recognition and localization benchmarks. We also demonstrate that as a readily plug-in operation with negligible computation overhead, TAda Conv can effectively improve many existing video models with a convincing margin.
Researcher Affiliation Collaboration 1Advanced Robotics Centre, National University of Singapore 2DAMO Academy, Alibaba Group 3S-Lab, Nanyang Technological University
Pseudocode No The paper describes procedures using text and mathematical formulations but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes 2Project page: https://tadaconv-iclr2022.github.io/.
Open Datasets Yes For video classification, we use Kinetics-400 (Kay et al., 2017), Something-Something V2 (Goyal et al., 2017), and Epic-Kitchens-100 (Damen et al., 2020). For action localization, we use HACS (Zhao et al., 2019) and Epic-Kitchens-100 (Damen et al., 2020).
Dataset Splits No The paper mentions training and evaluation processes, including frame sampling and data augmentation strategies (Appendix C.1), and uses 'validation' in its figures (Appendix I). However, it does not explicitly describe the dataset splits (e.g., percentages or sample counts for training, validation, and test sets) in the main text or appendix, nor does it explicitly state the use of standard splits for the cited datasets.
Hardware Specification No Our experiments on the action classification are conducted on three large-scale datasets. For all action classification models, we train them with synchronized SGD using 16 GPUs.
Software Dependencies No The paper mentions optimization algorithms like SGD and AdamW, and deep learning models (ResNet, ConvNeXt), but it does not specify version numbers for any software components such as Python, PyTorch, or CUDA.
Experiment Setup Yes Our experiments on the action classification are conducted on three large-scale datasets. For all action classification models, we train them with synchronized SGD using 16 GPUs. The batch size for each GPU is 16 and 8 respectively for 8-frame and 16-frame models... For all models, we use a dropout ratio (Hinton et al., 2012) of 0.5 before the classification heads. Spatially, we randomly resize the short side of the video to [256, 320] and crop a region of 224 224... On Kinetics-400, a half-period cosine schedule is applied for decaying the learning rate following Feichtenhofer et al. (2019), with the base learning rate set to 0.24 for Res Net-base models using SGD... The models are trained for 100 epochs. In the first 8 epochs, we adopt a linear warm-up strategy starting from a learning rate of 0.01. The weight decay is set to 1e-4.