Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Authors: Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Leonid Sigal, Roland Memisevic
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate state-of-the-art multi-modal LLMs on the Qualcomm Interactive Cooking benchmark and introduce LIVEMAMBA, a streaming multi-modal LLM designed for interactive instructional guidance. This work provides the first dedicated benchmark and a strong baseline for developing and evaluating on live, situated coaching. We report the zero-shot evaluation results in Tab. 3 for the main set of the Qualcomm Interactive Cooking benchmark. |
| Researcher Affiliation | Collaboration | 1 Qualcomm AI Research 2 University of British Columbia |
| Pseudocode | No | The paper describes its methods through architectural diagrams (e.g., Figure 2) and detailed textual explanations within sections like "LIVEMAMBA for Step-by-Step Instructions" and its subsections, as well as in the appendix, but it does not include explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Click here to access the data and click here to access the code. |
| Open Datasets | Yes | To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon Captain Cook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. |
| Dataset Splits | Yes | We provide statistics of the Qualcomm Interactive Cooking benchmark and dataset in Tab. 2, including the total length in hours, number of videos, and numbers of instructions and feedback messages. The training, validation, and testing splits within each set follow the original video recoding split from Captain Cook4D, and the test splits correspond to the Qualcomm Interactive Cooking benchmark. |
| Hardware Specification | Yes | In terms of throughput, on a consumer Nvidia H100 GPU, our Live Mamba model has a real-time factor of 4 on average: it can process input data four times as fast at 8.1 frames per second as it becomes available at 2 frames per second. The LIVEMAMBA model is trained using 8 Nvidia H100 GPUs. |
| Software Dependencies | No | The paper mentions using the Adam W [42] optimizer and specific LLM models like Qwen3-32B [16] and Qwen-2.5-32B-Instruct [15]. However, it does not provide specific version numbers for programming languages (e.g., Python) or common software libraries/frameworks (e.g., PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | We use input video resolution of 448 448 at 2 fps. The Intern Vi T-300M-448px-V2_5 vision head produces N = 1025 tokens (including the CLS token) per input frame. We use the mechanism outlined in Vision Zip [59] to reduce the number of tokens to 256. Then, our Q-Former reduces this further to K = 32 tokens. ... During the pre-training phase, we train only the Q-Former and the LIVEMAMBA model is trained using a learning rate of 1 10 5 for 200k iterations. We again use a learning rate of 1 10 5, for 120k iterations. During the fine-tuning phase, we train on single recipe steps and clip the maximum length to 3 minutes. During inference, we re-initialize the LIVEMAMBA model after every recipe step. |