Look, Remember and Reason: Grounded Reasoning in Videos with Language Models
Authors: Apratim Bhattacharyya, Sunny Panchal, Reza Pourreza, Mingu Lee, Pulkit Madan, Roland Memisevic
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets. Our approach is trainable end-to-end and surpasses state-of-the-art task-specific methods across tasks by a large margin. |
| Researcher Affiliation | Industry | Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, Roland Memisevic Qualcomm AI Research |
| Pseudocode | No | The paper includes architectural diagrams and mathematical equations, but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | No | The paper does not contain an explicit statement that the authors' code is released or a direct link to a code repository for their methodology. |
| Open Datasets | Yes | We evaluate on visual reasoning tasks from: ACRE (Zhang et al., 2021), Something-Else Materzynska et al. (2020), CATER (Girdhar & Ramanan, 2020) and STAR (Wu et al., 2021). |
| Dataset Splits | No | The paper mentions evaluating on a "validation set" for STAR and different splits for Something-Else (base and compositional) and CATER (static and moving camera), but does not explicitly provide percentages or sample counts for training, validation, and testing splits across all datasets needed for reproduction. |
| Hardware Specification | Yes | We use 4 Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer and OPT family LMs, but does not provide specific software package names with version numbers (e.g., PyTorch, TensorFlow, or specific library versions). |
| Experiment Setup | Yes | We trained our LRR model with the OPT-125M and OPT-1.3B backbone until convergence ( 500k iterations) with a batch size of 4. We use the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 1 10 5, β1 = 0.9, β2 = 0.95 and λ (weight decay) = 0.1 and gradient clipping with a norm of 1.0. |