Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PlanLLM: Video Procedure Planning with Refinable Large Language Models

Authors: Dejie Yang, Zijing Zhao, Yang Liu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our Plan LLM achieves superior performance on three benchmarks, demonstrating the effectiveness of our designs. Code https://github.com/idejie/Plan LLM
Researcher Affiliation	Academia	1 Wangxuan Institute of Computer Technology, Peking University 2 State Key Laboratory of General Artificial Intelligence, Peking University EMAIL, EMAIL
Pseudocode	No	No explicit pseudocode or algorithm blocks were found in the main body of the paper.
Open Source Code	Yes	Code https://github.com/idejie/Plan LLM
Open Datasets	Yes	We employ three commonly used video datasets: Cross Task (Zhukov et al. 2019), NIV (Alayrac et al. 2016), and COIN (Tang et al. 2019).
Dataset Splits	No	The paper mentions using three commonly used video datasets: Cross Task, NIV, and COIN, but does not provide specific training/testing/validation split percentages, sample counts, or explicit references to how these datasets were partitioned for the experiments in the main text.
Hardware Specification	Yes	training the model with a batch size of 32 on NVIDIA A800 GPUs.
Software Dependencies	No	The paper mentions using S3D network, CLIP, BLIP2, Vicuna-7B, and LoRA, but does not provide specific version numbers for any of these software components or libraries.
Experiment Setup	Yes	During the frozen LLM training stage, we set the learning rate to 1 10 4 for the Q-Former and 1 10 3 for other modules, training the model with a batch size of 32 on NVIDIA A800 GPUs.