reproducibilityindex.ai

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Authors: Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, Kristen Grauman

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.
Researcher Affiliation	Collaboration	Kumar Ashutosh UT Austin and FAIR, Meta Santhosh Kumar Ramakrishnan UT Austin Triantafyllos Afouras FAIR, Meta Kristen Grauman UT Austin and FAIR, Meta
Pseudocode	No	The paper describes the technical approach and steps in prose and with diagrams (e.g., Fig. 2), but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Project Page: https://vision.cs.utexas.edu/projects/task_graph/
Open Datasets	Yes	We use three public datasets of instructional videos COIN, Cross Task, and How To100M all of which were compiled from in-the-wild data on You Tube, and are accompanied by ASR transcriptions of the You Tuber s spoken narrations (nt).
Dataset Splits	No	Table 3 shows the task classification results compared to multiple state-of-the-art methods on the validation split of How To100M, the dataset we use for pretraining FV.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions optimizers (SGD, Adam W) and models (MP-Net, Time Sformer) but does not provide specific version numbers for software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For zero-shot keystep segmentation, we use γ = 0.5 and γ = 0.3 for text and video features, respectively, since video features offer stronger supervision. For representation learning, similar to [45], we train the video model for 15 epochs with SGD with learning rate 5 10 3 followed by 15 epochs with Adam W [46] with learning rate 5 10 5. In both cases, the learning rate is decayed progressively by 10 times in epochs 11 and 14.