Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR

Authors: Shenghui Chen, Po-han Li, Sandeep Chinchali, Ufuk Topcu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Human studies on Learning Paper24, SUTD-Traffic QA, and Long Video Bench show that summaries selected by VIBE consistently improve performance boosting task accuracy by up to 61.23% and reducing response time by 75.77% compared to naive VLM summaries or raw video. 2
Researcher Affiliation Academia Shenghui Chen , Po-han Li , Sandeep Chinchali, Ufuk Topcu The University of Texas at Austin EMAIL
Pseudocode No The paper describes methods using mathematical formulations and diagrams (e.g., Figure 2 and equations 1-5), but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Project Website, Code, and Learning Paper24 Dataset. Refer to the supplemental material for code.
Open Datasets Yes To evaluate VIBE, we conduct between-subjects user studies with 243 participants across three datasets Learning Paper24 (self-curated), Long Video Bench [10], and SUTD-Traffic QA [11] measuring human performance in terms of accuracy, response time, and inverse efficiency score, the ratio of response time to accuracy [12].
Dataset Splits Yes Learning Paper24 We randomly sample 80 papers each from ICLR 2024 and Neur IPS 2024 following the curation process in Appendix C. For each presentation, we generate 5 VLM responses and 1 Co T response. SUTD-Traffic QA We randomly sample 100 video clips from the dataset. For each video clip, we generate 5 VLM responses and 1 Co T response. Long Video Bench We sample 150 video clips, each 30 to 500 seconds long, from categories ["E2O", "E2E", "O3O", "S2A", "S2E", "S2O", "SOS"]. For each clip, we generate 5 VLM responses and 1 Co T response. For each dataset, 10 video stimuli with multiple-choice questions are shown in randomized order.
Hardware Specification Yes All experiments were conducted on four NVIDIA RTX 6000 Ada GPUs (48GB VRAM) using the vLLM backend. The system was equipped with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz, featuring 64 cores (x86_64, 64-bit).
Software Dependencies No The paper mentions using the 'vLLM backend', 'Easy OCR [41]', and 'Open AI API [39]' for experiments. However, it does not provide specific version numbers for any of these software components or other libraries, which are necessary for full reproducibility.
Experiment Setup Yes Here, α, β are hyperparameters controlling the trade-off between grounding and utility in eq. (5). A higher α emphasizes alignment with video, favoring summaries that faithfully reflect the video, while a higher β prioritizes task relevance, selecting summaries that improve downstream performance. ...we perform a convex combination sweep over α {0, 0.05, 0.1, . . . , 1.0}, with β = 1 α. ...k = 5 in this study, and the k responses are generated with various temperatures. ...In all datasets, we mask text by removing keywords with high tf-idf scores.