TokenLearner: Adaptive Space-Time Tokenization for Videos

Authors: Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate strong performance on several challenging benchmarks for video recognition tasks. We establish new state-of-the-arts on multiple video datasets, including Kinetics-400, Kinetics-600, Charades, and AVi D.
Researcher Affiliation Collaboration Michael S. Ryoo1,2, AJ Piergiovanni1, Anurag Arnab1, Mostafa Dehghani1, Anelia Angelova1 1Google Research 2Stony Brook University {mryoo,ajpiergi,aarnab,dehghani,anelia}@google.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code will be available at: https://github.com/google-research/ scenic/tree/main/scenic/projects/token_learner
Open Datasets Yes We use the Kinetics datasets... We train and evaluate on both Kinetics-400 and Kinetics-600 datasets... We follow the standard settings used in previous papers and report accuracy on the validation set [5, 12].Charades dataset [31]AVi D dataset [27]
Dataset Splits Yes We train and evaluate on both Kinetics-400 and Kinetics-600 datasets... We follow the standard settings used in previous papers and report accuracy on the validation set [5, 12].
Hardware Specification No The paper mentions FLOPs and GFLOPS for computational cost but does not specify the CPU, GPU, or other hardware used for running experiments.
Software Dependencies No The paper mentions using the Scenic library (built on JAX) but does not provide specific version numbers for JAX or other critical software dependencies.
Experiment Setup Yes Following the setting of [2], we used the input resolution of 224x224, extracting tubelets, and attaching positional encodings. We tried various number of tokens including S = 8, 16, 32, and use S = 8 and 16 as our default settings.We use 224 224 64 videos for training and 256 256 64 videos for testing.S = 8 number of tokens were used.