YTCommentQA: Video Question Answerability in Instructional Videos
Authors: Saelyne Yang, Sunghyun Park, Yunseok Jang, Moontae Lee
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with answerability classification tasks demonstrate the complexity of YTComment QA and emphasize the need to comprehend the combined role of visual and script information in video reasoning. |
| Researcher Affiliation | Collaboration | Saelyne Yang1*, Sunghyun Park2, Yunseok Jang3*, Moontae Lee2,4 1KAIST, 2LG AI Research, 3University of Michigan, 4University of Illinois Chicago |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. Figure 2 shows an "Annotation workflow," which is a flowchart, not pseudocode. |
| Open Source Code | Yes | The dataset is available at https://github.com/lgresearch/YTComment QA. |
| Open Datasets | Yes | The dataset is available at https://github.com/lgresearch/YTComment QA. |
| Dataset Splits | Yes | We divided the collected data into training and evaluation sets in a 1:1 ratio to fine-tune the models. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions several software components like "BERT-based model (Nurmanbetov 2021)", "You Tube Data API (Google 2023)", "language detection library (Shuyo 2014)", "question detection module (Khan 2021)", "Tesseract OCR (Tesseract OCR 2023)", "Llama2 (Touvron et al. 2023)", and "Flash Attention algorithm (Dao et al. 2022)", but it does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We divided the collected data into training and evaluation sets in a 1:1 ratio to fine-tune the models. We addressed the class imbalance by augmenting unanswerable questions. The prompts used for fine-tuning are outlined in Appendix B and training details in Appendix E. We divided the video into segments corresponding to five transcript sentences. We then generated summaries for each segment using Chat GPT. We oversampled Unanswerable, Visual Answerable, and Combined Answerable classes by a factor of two to address the class imbalance in the training set. However, we extended its context window for Llama2 by using rotary positional embeddings (Su et al. 2022) and the Flash Attention algorithm (Dao et al. 2022) to incorporate the longer text inputs. For Se Vi LA, we processed 768 tokens from the transcript, truncating sequences that exceeded the limit. |