Memory Consolidation Enables Long-Context Video Understanding
Authors: Ivana Balazevic, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, Olivier J Henaff
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experiments, We evaluate our method on four challenging datasets for long-context video understanding, namely Diving48, Ego Schema, Next-QA, and Perception Test. |
| Researcher Affiliation | Industry | 1Google DeepMind. Correspondence to: Ivana Balaˇzevi c <balazevic@google.com>, Olivier J. H enaff <henaff@google.com>. |
| Pseudocode | Yes | Algorithm 1 Memory-consolidated Vi T., Algorithm 2 Streaming Vi T., Algorithm 3 Memory-augmented Vi T. |
| Open Source Code | No | The paper does not explicitly state that the authors are releasing their code for the described methodology or provide a link to a code repository. |
| Open Datasets | Yes | We evaluate our method on four challenging datasets for long-context video understanding, namely Diving48, Ego Schema, Next-QA, and Perception Test. ... Diving48 (Li et al., 2018) ... Ego Schema (Mangalam et al., 2023) ... Next-QA (Xiao et al., 2021) ... Perception Test (P atr aucean et al., 2023) |
| Dataset Splits | No | The paper refers to fine-tuning and evaluation, and mentions a 'test video' but does not explicitly provide details about a distinct validation set or its specific split for reproducibility. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or TPU versions) used for running its experiments. |
| Software Dependencies | No | The paper mentions general software components like 'BERT-style language encoder' and 'Vi Vi T' and refers to a 'Lo RA' adaptation without providing specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Table 5. Training specifications for fine-tuning MC-Vi T per dataset. Optimizer Adam W Learning rate schedule Cosine with linear warmup Gradient clip 2.0 Linear warmup steps 1k Frame-level resolution 256 256 Batch size 128 256 Label smoothing 0 0.1 # memories/segment (K) 128 512 Frame sampling Uniform 4 FPS Weight decay rate 0 0 1e-2 Base learning rate 2e-5 5e-5 1e-6 Training steps 5k 30k 20k |