Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Authors: Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris Kitani, László Jeni
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To analyze RLT s impact on performance and speed, we conduct several experiments on standard action recognition tasks. We measure the speedup on model training at several scales in Section 4.1 as well as RLT s effect as a drop-in addition at inference time in Section 4.2. We perform ablations in Section 4.3, then evaluate RLT s effect on higher FPS videos and long video datasets in Section 4.4. Finally, we provide qualitative visualizations in Section 4.5. |
| Researcher Affiliation | Collaboration | 1 Carnegie Mellon University 2 Fujitsu Research |
| Pseudocode | No | The paper describes the RLT procedure in text and uses figures to illustrate it, but it does not include a formal pseudocode block or algorithm listing. |
| Open Source Code | Yes | Our project page is at https://rccchoudhury.github.io/projects/rlt/. Our code, demos and associated blog post are all located on our project page. We also include our code in this submission and will open-source the code upon releasing the paper. |
| Open Datasets | Yes | We train and evaluate RLT on Kinetics-400 (K400) [19] and Something-Something-v2 (SSv2) [16]. |
| Dataset Splits | No | K400 has 240k training examples and 40k test examples, while SSv2 has 170k training examples and 30k test examples. The paper mentions training for up to 100 epochs with specific hyperparameters, which implies a validation process, but it does not explicitly state the details of a validation split (e.g., percentage, sample count, or method for creating it). |
| Hardware Specification | Yes | All experiments were conducted with 8x H100 Nvidia GPUs with 128 CPU cores, with 16 workers per GPU. |
| Software Dependencies | No | The paper mentions software like PyTorch [32], timm [47], Flash Attention [8, 9], and AVION [54], but it does not specify exact version numbers for these software dependencies, which is required for a reproducible description. |
| Experiment Setup | Yes | We follow the recommended training recipes from Video MAE for each model size, namely training for up to 100 epochs, with batch size 256, learning rate with warm-up to 1 10 3 for 5 epochs, then cosine annealing down to 1 10 6. We also use Rand Augment, random erasing, Cut Mix, and standard cropping/scaling and flipping. |