reproducibilityindex.ai

Self-Chained Image-Language Model for Video Localization and Question Answering

Authors: Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of Se Vi LA framework on five challenging video question answering and event prediction benchmarks (NEx T-QA, STAR, How2QA, TVQA, and VLEP) [75, 77, 36, 27, 28], where Se Vi LA outperforms several strong baselines/previous works, and achieves the stateof-the-art in both fine-tuning (NEx T-QA and STAR) and zero-shot (NEx T-QA, STAR, How2QA, and VLEP) settings.
Researcher Affiliation	Academia	UNC Chapel Hill {shoubin, jmincho, praty, mbansal}@cs.unc.edu
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and checkpoints are available at: https://github.com/Yui010206/Se Vi LA
Open Datasets	Yes	Benchmarks. We evaluate our Se Vi LA framework on 3 video-language tasks, including multi-choice Video Question Answering (NEx T-QA [77], STAR [75], How2QA [36], TVQA [27]), Video Event Prediction (VLEP [28]), and Moment Retrieval (QVHighlights [30]).
Dataset Splits	Yes	For NEx T-QA, STAR, How2QA, TVQA, and VLEP we report the performance on the validation set whereas QVHighlights we report on the hidden test set.
Hardware Specification	Yes	We conduct experiments with 4 48GB A6000 GPUs. For Localizer per-training, we pre-train Localizer on the QVHighlights for 80 epochs, taking approximately 12 hours with 4 29GB on A6000 GPUs.
Software Dependencies	No	The paper mentions software like PyTorch, Huggingface Transformers, and Torchvision but does not specify their version numbers.
Experiment Setup	Yes	We report Se Vi LA framework training hyperparameters in Localizer pre-training, Answerer fine-tuning, and Localizer self-refrinment, in Table 11.