YTCommentQA: Video Question Answerability in Instructional Videos

Authors: Saelyne Yang, Sunghyun Park, Yunseok Jang, Moontae Lee

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with answerability classification tasks demonstrate the complexity of YTComment QA and emphasize the need to comprehend the combined role of visual and script information in video reasoning.
Researcher Affiliation Collaboration Saelyne Yang1*, Sunghyun Park2, Yunseok Jang3*, Moontae Lee2,4 1KAIST, 2LG AI Research, 3University of Michigan, 4University of Illinois Chicago
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks. Figure 2 shows an "Annotation workflow," which is a flowchart, not pseudocode.
Open Source Code Yes The dataset is available at https://github.com/lgresearch/YTComment QA.
Open Datasets Yes The dataset is available at https://github.com/lgresearch/YTComment QA.
Dataset Splits Yes We divided the collected data into training and evaluation sets in a 1:1 ratio to fine-tune the models.
Hardware Specification No The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions several software components like "BERT-based model (Nurmanbetov 2021)", "You Tube Data API (Google 2023)", "language detection library (Shuyo 2014)", "question detection module (Khan 2021)", "Tesseract OCR (Tesseract OCR 2023)", "Llama2 (Touvron et al. 2023)", and "Flash Attention algorithm (Dao et al. 2022)", but it does not provide specific version numbers for these dependencies.
Experiment Setup Yes We divided the collected data into training and evaluation sets in a 1:1 ratio to fine-tune the models. We addressed the class imbalance by augmenting unanswerable questions. The prompts used for fine-tuning are outlined in Appendix B and training details in Appendix E. We divided the video into segments corresponding to five transcript sentences. We then generated summaries for each segment using Chat GPT. We oversampled Unanswerable, Visual Answerable, and Combined Answerable classes by a factor of two to address the class imbalance in the training set. However, we extended its context window for Llama2 by using rotary positional embeddings (Su et al. 2022) and the Flash Attention algorithm (Dao et al. 2022) to incorporate the longer text inputs. For Se Vi LA, we processed 768 tokens from the transcript, truncating sequences that exceeded the limit.