Temporal-attentive Covariance Pooling Networks for Video Recognition
Authors: Zilin Gao, Qilong Wang, Bingbing Zhang, Qinghua Hu, Peihua Li
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The extensive experiments on six benchmarks (e.g., Kinetics, Something-Something V1 and Charades) using various video architectures show our TCPNet is clearly superior to its counterparts, while having strong generalization ability. The source code is publicly available. ... To verify its effectiveness, extensive experiments are conducted on six video benchmarks (i.e., Mini-Kinetics-200 [66], Kinetics-400 [2], Something-Something V1 [19], Charades [46], UCF101 [49] and HMDB51 [23]) using various deep architectures (e.g., TSN [59], X3D [10] and TEA [33]). |
| Researcher Affiliation | Academia | School of Information and Communication Engineering, Dalian University of Technology College of Intelligence and Computing, Tianjin University gzl@mail.dlut.edu.cn, qlwang@tju.edu.cn, icyzhang@mail.dlut.edu.cn huqinghua@tju.edu.cn, peihuali@dlut.edu.cn |
| Pseudocode | Yes | Specifically, for each covariance representation PT CP output by our TCP, we compute its approximate matrix square root1 as follows: Iteration:{Qk = 1 2Qk 1(3I Rk 1Qk 1); Rk = 1 2(3I Rk 1Qk 1)Rk 1}k=1,...,K, (11) |
| Open Source Code | Yes | The source code is publicly available. |
| Open Datasets | Yes | The extensive experiments on six benchmarks (e.g., Kinetics, Something-Something V1 and Charades) using various video architectures show our TCPNet is clearly superior to its counterparts, while having strong generalization ability. The source code is publicly available. ... To verify its effectiveness, extensive experiments are conducted on six video benchmarks (i.e., Mini-Kinetics-200 [66], Kinetics-400 [2], Something-Something V1 [19], Charades [46], UCF101 [49] and HMDB51 [23]) |
| Dataset Splits | No | The paper mentions using several standard video benchmarks (e.g., Kinetics-400, Something-Something V1, UCF101) and refers to training settings from other papers, implying the use of their standard splits, but does not explicitly state the training/validation/test split percentages or sample counts within its own text. |
| Hardware Specification | Yes | All programs are implemented by Pytorch and run on a PC equipped with four NVIDIA Titan RTX GPUs. |
| Software Dependencies | No | The paper states 'All programs are implemented by Pytorch' but does not specify a version number for Pytorch or any other software dependencies with their versions. |
| Experiment Setup | Yes | Here we describe the settings of hyper-parameters on Mini-K200 and K-400. For training our TCPNet with 2D CNNs, we adopt the same data augmentation strategy as [59], and number of segments is set to 8 or 16. A dropout with a rate of 0.5 is used for the last FC layer. TCPNet is optimized by mini-batch stochastic gradient descent (SGD) with a batch size of 96, a momentum of 0.9 and a weight decay of 1e-4. The whole networks are trained within 50 epochs, where initial learning rate is 0.015 and decay by 0.1 every 20 epochs. For training our TCPNet with X3D-M, we process the images followed by [10], and 16 frames are sampled as inputs. The SGD with cosine training strategy is used to optimize the network parameters within 100 epochs, and the initial learning rate is set to 0.1. |