Space-time Mixing Attention for Video Transformer

Authors: Adrian Bulat, Juan Manuel Perez Rua, Swathikiran Sudhakaran, Brais Martinez, Georgios Tzimiropoulos

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time being significantly more efficient than other Video Transformer models. and 4.1 Experimental setup
Researcher Affiliation Collaboration Adrian Bulat Samsung AI Cambridge adrian@adrianbulat.com Juan-Manuel Perez-Rua Samsung AI Cambridge j.perez-rua@samsung.com Swathikiran Sudhakaran Samsung AI Cambridge swathikir.s@samsung.com Brais Martinez Samsung AI Cambridge brais.a@samsung.com Georgios Tzimiropoulos Samsung AI Cambridge Queen Mary University of London g.tzimiropoulos@qmul.ac.uk
Pseudocode No The paper describes the methods in narrative text and uses figures to illustrate concepts, but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Code for our method is made available here. and We will release code and models to facilitate this. and in the checklist: [No] We include however all implementation details required to reproduce our work. We will make the code and the models available.
Open Datasets Yes Datasets: We train and evaluate the proposed models on the following datasets (all datasets are publicly available for research purposes): Kinetics-400 and 600: The Kinetics [21] dataset... Something-Something-v2 (SSv2): The SSv2 [17] dataset... Epic Kitchens-100 (Epic-100): is an egocentric large scale action recognition dataset...
Dataset Splits No The paper mentions using well-known datasets like Kinetics, Something-Something-v2, and Epic Kitchens-100, which often have standard splits. However, it does not explicitly state the train/validation/test split percentages or sample counts within the paper, nor does it cite a specific paper for the exact splits used for these datasets.
Hardware Specification Yes The models were trained on 8 V100 GPUs using Py Torch [30].
Software Dependencies No The paper mentions 'Py Torch [30]' but does not specify a version number for it or any other software dependency.
Experiment Setup Yes specifically, our models were trained using SGD with momentum (0.9) and a cosine scheduler [28] (with linear warmup) for 35 epochs on SSv2, 50 on Epic-100 and 30 on Kinetics. The base learning rate, set at a batch size of 128, was 0.05 (0.03 for Kinetics). To prevent over-fitting we made use of the following augmentation techniques: random scaling (0.9 to 1.3 ) and cropping, random flipping (with probability of 0.5; not for SSv2) and autoaugment [8]. In addition, for SSv2 and Epic-100, we also applied random erasing (probability=0.5, min. area=0.02, max. area=1/3, min. aspect=0.3) [52] and label smoothing (λ = 0.3) [34] while, for Kinetics, we used mixup [51] (α = 0.4).