GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension

Authors: Jiafeng Liang, Shixin Jiang, Zekun Wang, Haojie Pan, Zerui Chen, Zheng Chu, Ming Liu, Ruiji Fu, Zhongyuan Wang, Bing Qin

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate plenty of foundation models with GUIDE and perform in-depth analysis.
Researcher Affiliation Collaboration 1Harbin Institute of Technology, Harbin, China 2Peng Cheng Laboratory, Shenzhen, China 3Kuaishou Technology, Beijing, China
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks, nor does it present structured steps in a code-like format.
Open Source Code No The paper does not provide an explicit statement about the release of its source code or a link to a code repository for the methodology described.
Open Datasets No The paper introduces a new dataset called GUIDE but does not provide a specific link, DOI, repository name, or formal citation for its public availability. It mentions sourcing videos from 'Kuaishou' but does not indicate where their compiled and annotated GUIDE dataset can be accessed.
Dataset Splits No The paper describes experimental settings and evaluations but does not provide specific percentages or counts for training, validation, and test dataset splits needed for reproduction.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory details) used to run its experiments.
Software Dependencies No The paper mentions several foundation models and APIs used (e.g., Video Chat, GPT-3.5-turbo, Whisper) but does not provide specific version numbers for ancillary software dependencies like programming languages, libraries, or frameworks required for replication.
Experiment Setup Yes In video segment captioning (VSC), we divide the video into multiple segments based on ground-truth step timestamps. We uniformly sample 8 frames for each segment and feed them to models. In entire video captioning (EVC), we uniformly sample 32 frames for each video and feed them to models. In guideline summarization, we modify the input format of the video foundation model to enable simultaneous processing of multiple videos. We uniformly sample 32 frames from each video as input. More details are in the Appendix A.3.