UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning
Authors: Kunchang Li, Yali Wang, Gao Peng, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. 4 EXPERIMENTS |
| Researcher Affiliation | Collaboration | 1Shen Zhen Key Lab of Computer Vision and Pattern Recognition, SIAT-Sense Time Joint Lab, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences, 3Shanghai AI Laboratory, Shanghai, China 4Sense Time Research, 5The Chinese University of Hong Kong |
| Pseudocode | No | The paper describes the model architecture and mathematical formulations but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Sense-X/UniFormer. |
| Open Datasets | Yes | We conduct experiments on widely-used Kinetics-400 (Carreira & Zisserman, 2017a) and larger benchmark Kinetics-600 (Carreira et al., 2018). We further verify the transfer learning performance on temporal-related datasets Something-Something V1&V2 (Goyal et al., 2017b). |
| Dataset Splits | Yes | We conduct experiments on widely-used Kinetics-400 (Carreira & Zisserman, 2017a) and larger benchmark Kinetics-600 (Carreira et al., 2018). We further verify the transfer learning performance on temporal-related datasets Something-Something V1&V2 (Goyal et al., 2017b). We evaluate our network with different numbers of clips and crops for the validation videos. As shown in Figure 4, since Kinetics is a scene-related dataset and trained with dense sampling, multi-clip testing is preferable to cover more frames for boosting performance. Alternatively, Something-Something is a temporal-related dataset and trained with uniform sampling, so multi-crop testing is better for capturing the discriminative motion for boosting performance. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types) used for running its experiments. |
| Software Dependencies | No | The paper mentions optimizers, normalization techniques, and training settings by referencing other works, but does not list specific software dependencies with version numbers (e.g., PyTorch version, CUDA version). |
| Experiment Setup | Yes | For Uni Former S, the warmup epoch, total epoch, stochastic depth rate, weight decay are set to 10, 110, 0.1 and 0.05 respectively for Kinetics and 5, 50, 0.3 and 0.05 respectively for Something-Something. For Uni Former-B, all the hyper-parameters are the same unless the stochastic depth rates are doubled. We linearly scale the base learning rates according to the batch size, which are 1e 4 batchsize /32 and 2e 4 batchsize /32 for Kinetics and Something-Something. |