Memory Consolidation Enables Long-Context Video Understanding

Authors: Ivana Balazevic, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, Olivier J Henaff

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments, We evaluate our method on four challenging datasets for long-context video understanding, namely Diving48, Ego Schema, Next-QA, and Perception Test.
Researcher Affiliation Industry 1Google DeepMind. Correspondence to: Ivana Balaˇzevi c <balazevic@google.com>, Olivier J. H enaff <henaff@google.com>.
Pseudocode Yes Algorithm 1 Memory-consolidated Vi T., Algorithm 2 Streaming Vi T., Algorithm 3 Memory-augmented Vi T.
Open Source Code No The paper does not explicitly state that the authors are releasing their code for the described methodology or provide a link to a code repository.
Open Datasets Yes We evaluate our method on four challenging datasets for long-context video understanding, namely Diving48, Ego Schema, Next-QA, and Perception Test. ... Diving48 (Li et al., 2018) ... Ego Schema (Mangalam et al., 2023) ... Next-QA (Xiao et al., 2021) ... Perception Test (P atr aucean et al., 2023)
Dataset Splits No The paper refers to fine-tuning and evaluation, and mentions a 'test video' but does not explicitly provide details about a distinct validation set or its specific split for reproducibility.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or TPU versions) used for running its experiments.
Software Dependencies No The paper mentions general software components like 'BERT-style language encoder' and 'Vi Vi T' and refers to a 'Lo RA' adaptation without providing specific version numbers for these software dependencies.
Experiment Setup Yes Table 5. Training specifications for fine-tuning MC-Vi T per dataset. Optimizer Adam W Learning rate schedule Cosine with linear warmup Gradient clip 2.0 Linear warmup steps 1k Frame-level resolution 256 256 Batch size 128 256 Label smoothing 0 0.1 # memories/segment (K) 128 512 Frame sampling Uniform 4 FPS Weight decay rate 0 0 1e-2 Base learning rate 2e-5 5e-5 1e-6 Training steps 5k 30k 20k