reproducibilityindex.ai

YTCommentQA: Video Question Answerability in Instructional Videos

Authors: Saelyne Yang, Sunghyun Park, Yunseok Jang, Moontae Lee

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with answerability classiﬁcation tasks demonstrate the complexity of YTComment QA and emphasize the need to comprehend the combined role of visual and script information in video reasoning.
Researcher Affiliation	Collaboration	Saelyne Yang1, Sunghyun Park2, Yunseok Jang3, Moontae Lee2,4 1KAIST, 2LG AI Research, 3University of Michigan, 4University of Illinois Chicago
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks. Figure 2 shows an "Annotation workflow," which is a flowchart, not pseudocode.
Open Source Code	Yes	The dataset is available at https://github.com/lgresearch/YTComment QA.
Open Datasets	Yes	The dataset is available at https://github.com/lgresearch/YTComment QA.
Dataset Splits	Yes	We divided the collected data into training and evaluation sets in a 1:1 ratio to ﬁne-tune the models.
Hardware Specification	No	The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions several software components like "BERT-based model (Nurmanbetov 2021)", "You Tube Data API (Google 2023)", "language detection library (Shuyo 2014)", "question detection module (Khan 2021)", "Tesseract OCR (Tesseract OCR 2023)", "Llama2 (Touvron et al. 2023)", and "Flash Attention algorithm (Dao et al. 2022)", but it does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We divided the collected data into training and evaluation sets in a 1:1 ratio to ﬁne-tune the models. We addressed the class imbalance by augmenting unanswerable questions. The prompts used for ﬁne-tuning are outlined in Appendix B and training details in Appendix E. We divided the video into segments corresponding to ﬁve transcript sentences. We then generated summaries for each segment using Chat GPT. We oversampled Unanswerable, Visual Answerable, and Combined Answerable classes by a factor of two to address the class imbalance in the training set. However, we extended its context window for Llama2 by using rotary positional embeddings (Su et al. 2022) and the Flash Attention algorithm (Dao et al. 2022) to incorporate the longer text inputs. For Se Vi LA, we processed 768 tokens from the transcript, truncating sequences that exceeded the limit.