Towards Automatic Learning of Procedures From Web Instructional Videos

Authors: Luowei Zhou, Chenliang Xu, Jason Corso

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation. For evaluation, we compare variants of our model with competitive baselines on standard metrics and the proposed methods demonstrate top performance against baselines.
Researcher Affiliation Academia Luowei Zhou Robotics Institute University of Michigan luozhou@umich.edu Chenliang Xu Department of CS University of Rochester Chenliang.Xu@rochester.edu Jason J. Corso Department of EECS University of Michigan jjcorso@eecs.umich.edu
Pseudocode No The paper describes the model architecture and procedures in detail using text and diagrams, but it does not include a structured pseudocode block or algorithm.
Open Source Code No The paper does not provide an unambiguous statement or a direct link to the source code for the methodology described in this paper. It only links to the dataset and a third-party ResNet implementation.
Open Datasets Yes Our new dataset, called You Cook21, contains 2000 videos from 89 recipes with a total length of 176 hours. 1Dataset website: http://youcook2.eecs.umich.edu
Dataset Splits Yes We randomly split the dataset to 67%:23%:10% for training, validation and testing according to each recipe.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper states 'Our implementation is in Torch' but does not provide specific version numbers for Torch or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes The sizes of the temporal conv. kernels (also anchor length) are from 3 to 123 with an interval of 8, which covers 95% of the segment durations in training set. The 16 explicit anchors centered at each frame, i.e., stride for temporal conv. is 1. We randomly select U = 100 samples from all the positive and negative samples respectively and feed in negative samples if positive ones are less than U. Our implementation is in Torch. All the LSTMs have one layer and 512 hidden units. For hyperparameters, the learning rate is 4 10 5. We use the Adam optimizer (Kingma and Ba 2014) for updating weights with α = 0.8 and β = 0.999. Note that we disable the CNN fine-tuning which heavily slows down the training process.