Look, Remember and Reason: Grounded Reasoning in Videos with Language Models

Authors: Apratim Bhattacharyya, Sunny Panchal, Reza Pourreza, Mingu Lee, Pulkit Madan, Roland Memisevic

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets. Our approach is trainable end-to-end and surpasses state-of-the-art task-specific methods across tasks by a large margin.
Researcher Affiliation Industry Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, Roland Memisevic Qualcomm AI Research
Pseudocode No The paper includes architectural diagrams and mathematical equations, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code No The paper does not contain an explicit statement that the authors' code is released or a direct link to a code repository for their methodology.
Open Datasets Yes We evaluate on visual reasoning tasks from: ACRE (Zhang et al., 2021), Something-Else Materzynska et al. (2020), CATER (Girdhar & Ramanan, 2020) and STAR (Wu et al., 2021).
Dataset Splits No The paper mentions evaluating on a "validation set" for STAR and different splits for Something-Else (base and compositional) and CATER (static and moving camera), but does not explicitly provide percentages or sample counts for training, validation, and testing splits across all datasets needed for reproduction.
Hardware Specification Yes We use 4 Nvidia A100 GPUs.
Software Dependencies No The paper mentions using the Adam W optimizer and OPT family LMs, but does not provide specific software package names with version numbers (e.g., PyTorch, TensorFlow, or specific library versions).
Experiment Setup Yes We trained our LRR model with the OPT-125M and OPT-1.3B backbone until convergence ( 500k iterations) with a batch size of 4. We use the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 1 10 5, β1 = 0.9, β2 = 0.95 and λ (weight decay) = 0.1 and gradient clipping with a norm of 1.0.