Self-Chained Image-Language Model for Video Localization and Question Answering

Authors: Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of Se Vi LA framework on five challenging video question answering and event prediction benchmarks (NEx T-QA, STAR, How2QA, TVQA, and VLEP) [75, 77, 36, 27, 28], where Se Vi LA outperforms several strong baselines/previous works, and achieves the stateof-the-art in both fine-tuning (NEx T-QA and STAR) and zero-shot (NEx T-QA, STAR, How2QA, and VLEP) settings.
Researcher Affiliation Academia UNC Chapel Hill {shoubin, jmincho, praty, mbansal}@cs.unc.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and checkpoints are available at: https://github.com/Yui010206/Se Vi LA
Open Datasets Yes Benchmarks. We evaluate our Se Vi LA framework on 3 video-language tasks, including multi-choice Video Question Answering (NEx T-QA [77], STAR [75], How2QA [36], TVQA [27]), Video Event Prediction (VLEP [28]), and Moment Retrieval (QVHighlights [30]).
Dataset Splits Yes For NEx T-QA, STAR, How2QA, TVQA, and VLEP we report the performance on the validation set whereas QVHighlights we report on the hidden test set.
Hardware Specification Yes We conduct experiments with 4 48GB A6000 GPUs. For Localizer per-training, we pre-train Localizer on the QVHighlights for 80 epochs, taking approximately 12 hours with 4 29GB on A6000 GPUs.
Software Dependencies No The paper mentions software like PyTorch, Huggingface Transformers, and Torchvision but does not specify their version numbers.
Experiment Setup Yes We report Se Vi LA framework training hyperparameters in Localizer pre-training, Answerer fine-tuning, and Localizer self-refrinment, in Table 11.