Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Self-Chained Image-Language Model for Video Localization and Question Answering
Authors: Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of Se Vi LA framework on five challenging video question answering and event prediction benchmarks (NEx T-QA, STAR, How2QA, TVQA, and VLEP) [75, 77, 36, 27, 28], where Se Vi LA outperforms several strong baselines/previous works, and achieves the stateof-the-art in both fine-tuning (NEx T-QA and STAR) and zero-shot (NEx T-QA, STAR, How2QA, and VLEP) settings. |
| Researcher Affiliation | Academia | UNC Chapel Hill EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and checkpoints are available at: https://github.com/Yui010206/Se Vi LA |
| Open Datasets | Yes | Benchmarks. We evaluate our Se Vi LA framework on 3 video-language tasks, including multi-choice Video Question Answering (NEx T-QA [77], STAR [75], How2QA [36], TVQA [27]), Video Event Prediction (VLEP [28]), and Moment Retrieval (QVHighlights [30]). |
| Dataset Splits | Yes | For NEx T-QA, STAR, How2QA, TVQA, and VLEP we report the performance on the validation set whereas QVHighlights we report on the hidden test set. |
| Hardware Specification | Yes | We conduct experiments with 4 48GB A6000 GPUs. For Localizer per-training, we pre-train Localizer on the QVHighlights for 80 epochs, taking approximately 12 hours with 4 29GB on A6000 GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch, Huggingface Transformers, and Torchvision but does not specify their version numbers. |
| Experiment Setup | Yes | We report Se Vi LA framework training hyperparameters in Localizer pre-training, Answerer fine-tuning, and Localizer self-refrinment, in Table 11. |