reproducibilityindex.ai

Memory Consolidation Enables Long-Context Video Understanding

Authors: Ivana Balazevic, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, Olivier J Henaff

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4. Experiments, We evaluate our method on four challenging datasets for long-context video understanding, namely Diving48, Ego Schema, Next-QA, and Perception Test.
Researcher Affiliation	Industry	1Google DeepMind. Correspondence to: Ivana Balaˇzevi c <balazevic@google.com>, Olivier J. H enaff <henaff@google.com>.
Pseudocode	Yes	Algorithm 1 Memory-consolidated Vi T., Algorithm 2 Streaming Vi T., Algorithm 3 Memory-augmented Vi T.
Open Source Code	No	The paper does not explicitly state that the authors are releasing their code for the described methodology or provide a link to a code repository.
Open Datasets	Yes	We evaluate our method on four challenging datasets for long-context video understanding, namely Diving48, Ego Schema, Next-QA, and Perception Test. ... Diving48 (Li et al., 2018) ... Ego Schema (Mangalam et al., 2023) ... Next-QA (Xiao et al., 2021) ... Perception Test (P atr aucean et al., 2023)
Dataset Splits	No	The paper refers to fine-tuning and evaluation, and mentions a 'test video' but does not explicitly provide details about a distinct validation set or its specific split for reproducibility.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or TPU versions) used for running its experiments.
Software Dependencies	No	The paper mentions general software components like 'BERT-style language encoder' and 'Vi Vi T' and refers to a 'Lo RA' adaptation without providing specific version numbers for these software dependencies.
Experiment Setup	Yes	Table 5. Training specifications for fine-tuning MC-Vi T per dataset. Optimizer Adam W Learning rate schedule Cosine with linear warmup Gradient clip 2.0 Linear warmup steps 1k Frame-level resolution 256 256 Batch size 128 256 Label smoothing 0 0.1 # memories/segment (K) 128 512 Frame sampling Uniform 4 FPS Weight decay rate 0 0 1e-2 Base learning rate 2e-5 5e-5 1e-6 Training steps 5k 30k 20k