Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos
Authors: Yufan Zhou, Zhaobo Qi, Lingshuai Lin, Junqi Jing, Tingting Chai, Beichen Zhang, Shuhui Wang, Weigang Zhang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at https://github.com/Wiser Zhou/MTID. 4 EXPERIMENTS |
| Researcher Affiliation | Academia | 1Harbin Institute of Technology, Weihai 2Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and processes using text and figures (Figure 2: Overview of our Masked Temporal Interpolation Diffusion; Figure 3: Latent space temporal interpolation module & Residual temporal block & cross-attention module) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at https://github.com/Wiser Zhou/MTID. |
| Open Datasets | Yes | We evaluate our MTID method on three instructional video datasets: Cross Task (Zhukov et al., 2019), COIN (Tang et al., 2019), and NIV (Alayrac et al., 2016). |
| Dataset Splits | Yes | We randomly split each dataset into training (70% of videos per task) and testing (30%), following previous works (Sun et al., 2022; Wang et al., 2023b; Niu et al., 2024). |
| Hardware Specification | Yes | Training is performed using ADAM (Kingma, 2014) on 8 NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions using ADAM (Kingma, 2014) as the optimizer but does not specify version numbers for other software dependencies like programming languages or libraries (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | For the Cross Task dataset, we set the diffusion steps to 250 and train for 20,000 steps. The learning rate is linearly increased to 5 × 10−4 over the first 3,333 steps, then halved at steps 8,333, 13,333, and 18,333. For the NIV dataset, with 50 diffusion steps, training lasts for 5,000 steps. The learning rate ramps up to 3 × 10−4 over the first 1,000 steps and is reduced by 50% at steps 2,666 and 4,332. In the larger COIN dataset, we use 300 diffusion steps and train for 30,000 steps. The learning rate increases to 1 × 10−5 in the first 5,000 steps and is halved at steps 12,500, 20,000, and 27,500, stabilizing at 2.5 × 10−6 for the remaining steps. Training is performed using ADAM (Kingma, 2014) on 8 NVIDIA RTX 3090 GPUs. |