Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

Authors: Yufan Zhou, Zhaobo Qi, Lingshuai Lin, Junqi Jing, Tingting Chai, Beichen Zhang, Shuhui Wang, Weigang Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at https://github.com/Wiser Zhou/MTID. 4 EXPERIMENTS
Researcher Affiliation Academia 1Harbin Institute of Technology, Weihai 2Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods and processes using text and figures (Figure 2: Overview of our Masked Temporal Interpolation Diffusion; Figure 3: Latent space temporal interpolation module & Residual temporal block & cross-attention module) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at https://github.com/Wiser Zhou/MTID.
Open Datasets Yes We evaluate our MTID method on three instructional video datasets: Cross Task (Zhukov et al., 2019), COIN (Tang et al., 2019), and NIV (Alayrac et al., 2016).
Dataset Splits Yes We randomly split each dataset into training (70% of videos per task) and testing (30%), following previous works (Sun et al., 2022; Wang et al., 2023b; Niu et al., 2024).
Hardware Specification Yes Training is performed using ADAM (Kingma, 2014) on 8 NVIDIA RTX 3090 GPUs.
Software Dependencies No The paper mentions using ADAM (Kingma, 2014) as the optimizer but does not specify version numbers for other software dependencies like programming languages or libraries (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes For the Cross Task dataset, we set the diffusion steps to 250 and train for 20,000 steps. The learning rate is linearly increased to 5 × 10−4 over the first 3,333 steps, then halved at steps 8,333, 13,333, and 18,333. For the NIV dataset, with 50 diffusion steps, training lasts for 5,000 steps. The learning rate ramps up to 3 × 10−4 over the first 1,000 steps and is reduced by 50% at steps 2,666 and 4,332. In the larger COIN dataset, we use 300 diffusion steps and train for 30,000 steps. The learning rate increases to 1 × 10−5 in the first 5,000 steps and is halved at steps 12,500, 20,000, and 27,500, stabilizing at 2.5 × 10−6 for the remaining steps. Training is performed using ADAM (Kingma, 2014) on 8 NVIDIA RTX 3090 GPUs.