TokenLearner: Adaptive Space-Time Tokenization for Videos
Authors: Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate strong performance on several challenging benchmarks for video recognition tasks. We establish new state-of-the-arts on multiple video datasets, including Kinetics-400, Kinetics-600, Charades, and AVi D. |
| Researcher Affiliation | Collaboration | Michael S. Ryoo1,2, AJ Piergiovanni1, Anurag Arnab1, Mostafa Dehghani1, Anelia Angelova1 1Google Research 2Stony Brook University {mryoo,ajpiergi,aarnab,dehghani,anelia}@google.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code will be available at: https://github.com/google-research/ scenic/tree/main/scenic/projects/token_learner |
| Open Datasets | Yes | We use the Kinetics datasets... We train and evaluate on both Kinetics-400 and Kinetics-600 datasets... We follow the standard settings used in previous papers and report accuracy on the validation set [5, 12].Charades dataset [31]AVi D dataset [27] |
| Dataset Splits | Yes | We train and evaluate on both Kinetics-400 and Kinetics-600 datasets... We follow the standard settings used in previous papers and report accuracy on the validation set [5, 12]. |
| Hardware Specification | No | The paper mentions FLOPs and GFLOPS for computational cost but does not specify the CPU, GPU, or other hardware used for running experiments. |
| Software Dependencies | No | The paper mentions using the Scenic library (built on JAX) but does not provide specific version numbers for JAX or other critical software dependencies. |
| Experiment Setup | Yes | Following the setting of [2], we used the input resolution of 224x224, extracting tubelets, and attaching positional encodings. We tried various number of tokens including S = 8, 16, 32, and use S = 8 and 16 as our default settings.We use 224 224 64 videos for training and 256 256 64 videos for testing.S = 8 number of tokens were used. |