Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Authors: Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, Kristen Grauman
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art. |
| Researcher Affiliation | Collaboration | Kumar Ashutosh UT Austin and FAIR, Meta Santhosh Kumar Ramakrishnan UT Austin Triantafyllos Afouras FAIR, Meta Kristen Grauman UT Austin and FAIR, Meta |
| Pseudocode | No | The paper describes the technical approach and steps in prose and with diagrams (e.g., Fig. 2), but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project Page: https://vision.cs.utexas.edu/projects/task_graph/ |
| Open Datasets | Yes | We use three public datasets of instructional videos COIN, Cross Task, and How To100M all of which were compiled from in-the-wild data on You Tube, and are accompanied by ASR transcriptions of the You Tuber s spoken narrations (nt). |
| Dataset Splits | No | Table 3 shows the task classification results compared to multiple state-of-the-art methods on the validation split of How To100M, the dataset we use for pretraining FV. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions optimizers (SGD, Adam W) and models (MP-Net, Time Sformer) but does not provide specific version numbers for software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For zero-shot keystep segmentation, we use γ = 0.5 and γ = 0.3 for text and video features, respectively, since video features offer stronger supervision. For representation learning, similar to [45], we train the video model for 15 epochs with SGD with learning rate 5 10 3 followed by 15 epochs with Adam W [46] with learning rate 5 10 5. In both cases, the learning rate is decayed progressively by 10 times in epochs 11 and 14. |