Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Training-free Online Video Step Grounding

Authors: Luca Zanella, Massimiliano Mancini, Yiming Wang, Alessio Tonioni, Elisa Ricci

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on three datasets show superior performance of BAGLM over state-of-the-art training-based offline methods. ... In this section, we describe our experimental protocol and present the comparison w.r.t. the state of the art (Sec. 5.1). Finally, we perform a detailed study on BAGLM (Sec. 5.2). We use the same datasets and metrics described in Sec. 3 in our experiments.
Researcher Affiliation	Collaboration	1University of Trento 2Fondazione Bruno Kessler 3Google
Pseudocode	No	The paper describes the Bayesian filtering components and their mathematical formulations (Eq. 2, 3, 4, 7, 8, 9, 10) and provides a diagram in Figure 3 titled 'Overview of BAGLM', but it does not contain a distinct block labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	The code is available at https://github.com/lucazanella/baglm.
Open Datasets	Yes	Experiments on three datasets show superior performance of BAGLM over state-of-the-art training-based offline methods. ... We evaluate methods on three public datasets: Cross Task [36], HT-Step [1], and Ego4D Goal-Step [27]. ... Table 12: List of URLs and licenses for all assets used. CROSSTASK https://github.com/Dm Zhukov/Cross Task BSD-3-Clause HT-STEP https://github.com/facebookresearch/htstep CC-BY-NC 4.0 EGO4D GOAL-STEP https://github.com/facebookresearch/ego4d-goalstep MIT COIN https://github.com/coin-dataset/annotations CC BY-NC 4.0
Dataset Splits	Yes	The seen validation and test splits follow [18], each containing 600 videos in total, with 5 videos per activity across 120 activities. We evaluate with the validation set of seen classes, as the evaluation server hosting the test sets for the seen and unseen classes is unavailable. ... We follow the same evaluation set as indicated in [14], using videos from primary tasks. ... We evaluate on its validation split.
Hardware Specification	Yes	We ran all experiments on a single NVIDIA H100 64GB GPU, except for LLa MA3-70B-Instruct [7], which required 4 H100 GPUs.
Software Dependencies	No	The paper mentions specific LMM and LLM models used, such as Intern VL2.5-8B and LLa MA3-70B-Instruct, which imply versions for these models. It also mentions 'Using the torchcodec3 library' and 'We use the model available at https://spacy.io/models/en#en_core_web_sm' but does not explicitly provide version numbers for general software dependencies like Python, PyTorch, or the `torchcodec` library itself in the text, nor for the `spaCy` model.
Experiment Setup	Yes	Our method is implemented considering Intern VL2.5-8B [6] as our LMM, based on the results of Sec. 3. We employ LLa MA3-70B-Instruct [7] as our LLM of choice to derive our transition model. To test our model, we split videos into sequences of non-overlapping 2-second segments, providing them as input to the LMM one after the other. ... In Fig. 4, we show how BAGLM s performance varies w.r.t. the segment duration (from 1 to 4 seconds), considering different LMMs: Intern VL2.5 8B [6], Intern VL3 8B [35], LLa VA-One Vision 7B [13], and Qwen2.5 7B [2]. ... Appendix A details the prompts used (πVSG, πprog, πprereq) and how scores are computed ('applying a softmax to the logits of the first generated token').