Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Authors: Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, Kristen Grauman

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.
Researcher Affiliation Collaboration Kumar Ashutosh UT Austin and FAIR, Meta Santhosh Kumar Ramakrishnan UT Austin Triantafyllos Afouras FAIR, Meta Kristen Grauman UT Austin and FAIR, Meta
Pseudocode No The paper describes the technical approach and steps in prose and with diagrams (e.g., Fig. 2), but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Project Page: https://vision.cs.utexas.edu/projects/task_graph/
Open Datasets Yes We use three public datasets of instructional videos COIN, Cross Task, and How To100M all of which were compiled from in-the-wild data on You Tube, and are accompanied by ASR transcriptions of the You Tuber s spoken narrations (nt).
Dataset Splits No Table 3 shows the task classification results compared to multiple state-of-the-art methods on the validation split of How To100M, the dataset we use for pretraining FV.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions optimizers (SGD, Adam W) and models (MP-Net, Time Sformer) but does not provide specific version numbers for software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For zero-shot keystep segmentation, we use γ = 0.5 and γ = 0.3 for text and video features, respectively, since video features offer stronger supervision. For representation learning, similar to [45], we train the video model for 15 epochs with SGD with learning rate 5 10 3 followed by 15 epochs with Adam W [46] with learning rate 5 10 5. In both cases, the learning rate is decayed progressively by 10 times in epochs 11 and 14.