Temporal-attentive Covariance Pooling Networks for Video Recognition

Authors: Zilin Gao, Qilong Wang, Bingbing Zhang, Qinghua Hu, Peihua Li

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The extensive experiments on six benchmarks (e.g., Kinetics, Something-Something V1 and Charades) using various video architectures show our TCPNet is clearly superior to its counterparts, while having strong generalization ability. The source code is publicly available. ... To verify its effectiveness, extensive experiments are conducted on six video benchmarks (i.e., Mini-Kinetics-200 [66], Kinetics-400 [2], Something-Something V1 [19], Charades [46], UCF101 [49] and HMDB51 [23]) using various deep architectures (e.g., TSN [59], X3D [10] and TEA [33]).
Researcher Affiliation Academia School of Information and Communication Engineering, Dalian University of Technology College of Intelligence and Computing, Tianjin University gzl@mail.dlut.edu.cn, qlwang@tju.edu.cn, icyzhang@mail.dlut.edu.cn huqinghua@tju.edu.cn, peihuali@dlut.edu.cn
Pseudocode Yes Specifically, for each covariance representation PT CP output by our TCP, we compute its approximate matrix square root1 as follows: Iteration:{Qk = 1 2Qk 1(3I Rk 1Qk 1); Rk = 1 2(3I Rk 1Qk 1)Rk 1}k=1,...,K, (11)
Open Source Code Yes The source code is publicly available.
Open Datasets Yes The extensive experiments on six benchmarks (e.g., Kinetics, Something-Something V1 and Charades) using various video architectures show our TCPNet is clearly superior to its counterparts, while having strong generalization ability. The source code is publicly available. ... To verify its effectiveness, extensive experiments are conducted on six video benchmarks (i.e., Mini-Kinetics-200 [66], Kinetics-400 [2], Something-Something V1 [19], Charades [46], UCF101 [49] and HMDB51 [23])
Dataset Splits No The paper mentions using several standard video benchmarks (e.g., Kinetics-400, Something-Something V1, UCF101) and refers to training settings from other papers, implying the use of their standard splits, but does not explicitly state the training/validation/test split percentages or sample counts within its own text.
Hardware Specification Yes All programs are implemented by Pytorch and run on a PC equipped with four NVIDIA Titan RTX GPUs.
Software Dependencies No The paper states 'All programs are implemented by Pytorch' but does not specify a version number for Pytorch or any other software dependencies with their versions.
Experiment Setup Yes Here we describe the settings of hyper-parameters on Mini-K200 and K-400. For training our TCPNet with 2D CNNs, we adopt the same data augmentation strategy as [59], and number of segments is set to 8 or 16. A dropout with a rate of 0.5 is used for the last FC layer. TCPNet is optimized by mini-batch stochastic gradient descent (SGD) with a batch size of 96, a momentum of 0.9 and a weight decay of 1e-4. The whole networks are trained within 50 epochs, where initial learning rate is 0.015 and decay by 0.1 every 20 epochs. For training our TCPNet with X3D-M, we process the images followed by [10], and 16 frames are sampled as inputs. The SGD with cosine training strategy is used to optimize the network parameters within 100 epochs, and the initial learning rate is set to 0.1.