Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

Authors: Han Lin, Tushar Nagarajan, Nicolas Ballas, Mahmoud Assran, Mojtaba Komeili, Mohit Bansal, Koustuv Sinha

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies over a total of five procedural learning tasks across four datasets (NIV, Cross Task, COIN and Ego4D-v2) show that our model advances the strong baselines in long-horizon action anticipation (+2.6% in Verb ED@20, +3.1% in Noun ED@20), and significantly improves the So TA in step forecasting (+5.0%), task classification (+3.8%), and procedure planning tasks (up to +2.28% in success rate, +3.39% in m Acc, and +0.90% in m Io U).
Researcher Affiliation Collaboration Han Lin 1, Tushar Nagarajan2, Nicolas Ballas2, Mido Assran2, Mojtaba Komeili2, Mohit Bansal1 & Koustuv Sinha2 1UNC Chapel Hill, 2FAIR, Meta EMAIL, EMAIL
Pseudocode Yes Furthermore, we provide the Py Torch (Ansel et al., 2024) implementation of VEDIT in Algorithm 1. Algorithm 1 Simplified Py Torch Implementation for Each VEDi T Block
Open Source Code Yes Project page: https://github.com/HL-hanlin/vedit.
Open Datasets Yes We evaluate our method on five downstream tasks across four datasets. COIN (Tang et al., 2019) contains 476 hours of You Tube videos covering 180 tasks and 778 unique steps of daily activities. For Ego4Dv2 (Grauman et al., 2022), we focus on the long-term action anticipation benchmark... In addition, we utilize NIV (Alayrac ets al., 2016), Cross Task (Zhukov et al., 2019), and COIN datasets to evaluate the procedure planning task...
Dataset Splits No The paper references existing works for training settings and discusses evaluation on a 'validation set' for Ego4D-v2, but it does not explicitly state the training/validation/test splits (e.g., percentages or exact sample counts) for any of the datasets within its own text. It defers to prior works like Grauman et al. (2022) and Niu et al. (2024) for training settings.
Hardware Specification Yes The pretraining is conducted on 128 H100 GPUs with a total batch size of 1024, and takes 2 days and 4.5 days for the 165M and 1.77B VEDIT models respectively (see Table 9 for model architecture details).
Software Dependencies No The paper mentions "Py Torch (Ansel et al., 2024)" in the appendix but does not provide a specific version number for PyTorch or any other key software libraries used in the implementation.
Experiment Setup Yes Our default VEDIT architecture contains 12 transformer blocks, with a hidden size of 2048 and attention head dimension of 64. During training, we apply classifier-free guidance with a scale of 7 and denoise the diffusion model for 24 steps using the Flow Matching Euler Discrete Scheduler (Esser et al., 2024). For COIN step forecasting and task classification tasks, we use a scheduled learning rate linearly increases from 5 x 10^-6 to 5 x 10^-5 during the first 3 epochs, and then decays to 5 x 10^-7 following a cosine schedule, with a total of 30 epochs. For long-horizon anticipation and the procedure planning tasks...we train the model for 100 and 500 epochs respectively.