reproducibilityindex.ai

Don't Look Twice: Faster Video Transformers with Run-Length Tokenization

Authors: Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris Kitani, László Jeni

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To analyze RLT s impact on performance and speed, we conduct several experiments on standard action recognition tasks. We measure the speedup on model training at several scales in Section 4.1 as well as RLT s effect as a drop-in addition at inference time in Section 4.2. We perform ablations in Section 4.3, then evaluate RLT s effect on higher FPS videos and long video datasets in Section 4.4. Finally, we provide qualitative visualizations in Section 4.5.
Researcher Affiliation	Collaboration	1 Carnegie Mellon University 2 Fujitsu Research
Pseudocode	No	The paper describes the RLT procedure in text and uses figures to illustrate it, but it does not include a formal pseudocode block or algorithm listing.
Open Source Code	Yes	Our project page is at https://rccchoudhury.github.io/projects/rlt/. Our code, demos and associated blog post are all located on our project page. We also include our code in this submission and will open-source the code upon releasing the paper.
Open Datasets	Yes	We train and evaluate RLT on Kinetics-400 (K400) [19] and Something-Something-v2 (SSv2) [16].
Dataset Splits	No	K400 has 240k training examples and 40k test examples, while SSv2 has 170k training examples and 30k test examples. The paper mentions training for up to 100 epochs with specific hyperparameters, which implies a validation process, but it does not explicitly state the details of a validation split (e.g., percentage, sample count, or method for creating it).
Hardware Specification	Yes	All experiments were conducted with 8x H100 Nvidia GPUs with 128 CPU cores, with 16 workers per GPU.
Software Dependencies	No	The paper mentions software like PyTorch [32], timm [47], Flash Attention [8, 9], and AVION [54], but it does not specify exact version numbers for these software dependencies, which is required for a reproducible description.
Experiment Setup	Yes	We follow the recommended training recipes from Video MAE for each model size, namely training for up to 100 epochs, with batch size 256, learning rate with warm-up to 1 10 3 for 5 epochs, then cosine annealing down to 1 10 6. We also use Rand Augment, random erasing, Cut Mix, and standard cropping/scaling and flipping.