reproducibilityindex.ai

Look, Remember and Reason: Grounded Reasoning in Videos with Language Models

Authors: Apratim Bhattacharyya, Sunny Panchal, Reza Pourreza, Mingu Lee, Pulkit Madan, Roland Memisevic

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets. Our approach is trainable end-to-end and surpasses state-of-the-art task-specific methods across tasks by a large margin.
Researcher Affiliation	Industry	Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, Roland Memisevic Qualcomm AI Research
Pseudocode	No	The paper includes architectural diagrams and mathematical equations, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	No	The paper does not contain an explicit statement that the authors' code is released or a direct link to a code repository for their methodology.
Open Datasets	Yes	We evaluate on visual reasoning tasks from: ACRE (Zhang et al., 2021), Something-Else Materzynska et al. (2020), CATER (Girdhar & Ramanan, 2020) and STAR (Wu et al., 2021).
Dataset Splits	No	The paper mentions evaluating on a "validation set" for STAR and different splits for Something-Else (base and compositional) and CATER (static and moving camera), but does not explicitly provide percentages or sample counts for training, validation, and testing splits across all datasets needed for reproduction.
Hardware Specification	Yes	We use 4 Nvidia A100 GPUs.
Software Dependencies	No	The paper mentions using the Adam W optimizer and OPT family LMs, but does not provide specific software package names with version numbers (e.g., PyTorch, TensorFlow, or specific library versions).
Experiment Setup	Yes	We trained our LRR model with the OPT-125M and OPT-1.3B backbone until convergence ( 500k iterations) with a batch size of 4. We use the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 1 10 5, β1 = 0.9, β2 = 0.95 and λ (weight decay) = 0.1 and gradient clipping with a norm of 1.0.