SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Authors: Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, Shih-Fu Chang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Cross Task, COIN, and NIV benchmark datasets demonstrate that our proposed SCHEMA model achieves state-of-the-art performance and obtains explainable visualizations.
Researcher Affiliation Academia 1Columbia University 2The Hong Kong University of Science and Technology
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code: https://github.com/Wenliang Guo/SCHEMA
Open Datasets Yes We evaluate our SCHEMA method on three benchmark instruction video datasets, Cross Task (Zhukov et al., 2019), and COIN (Tang et al., 2019), and NIV (Alayrac et al., 2016).
Dataset Splits No Following previous works (Chang et al., 2020; Bi et al., 2021; Sun et al., 2022), we randomly select 70% of the videos in each task as the training set and take the others as the test set. A separate validation split percentage or sample count is not explicitly provided.
Hardware Specification Yes The training process takes 1 hour (500 epochs) on Cross Task and 5.5 hours (400 epochs) on COIN using a single V100 GPU.
Software Dependencies No The paper mentions software components like 'Adam optimizer', 'CLIP', 'S3D network', and 'GPT-3.5', but it does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We train our model with Adam optimizer, an initial learning rate set to 5e-3 decayed by 0.65 every 40 epochs. The batch size is set as 256. Each self-attention and cross-attention module consists of 32 heads and the hidden layer size is set as 128. The step classifier is a two-layer MLP with hidden size of 128. The dropout ratio is 0.2.