Slot-VLM: Object-Event Slots for Video-Language Modeling

Authors: Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, Yan Lu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering2.
Researcher Affiliation Collaboration Jiaqi Xu1 , Cuiling Lan2, Wenxuan Xie2, Xuejin Chen1, Yan Lu2 1University of Science and Technology of China, 2Microsoft Research Asia xujiaqi@mail.ustc.edu.cn, {culan,wenxie,yanlu}@microsoft.com, xjchen99@ustc.edu.cn
Pseudocode No The paper describes its methodology in natural language and with diagrams, but does not include any formal pseudocode or algorithm blocks.
Open Source Code No This paper is the result of an open source research project starting from October, 2023. [...] We will release the code.
Open Datasets Yes We use the Video Instruction Data, collected by [28], for video instruction tuning. [...] We evaluate the performance on three open-ended video question-answering (QA) benchmarks like MSVD-QA[9], MSRVTT-QA[44], and Activity Net-QA [7].
Dataset Splits No The paper mentions "instruction tuning" and evaluation on "test set" but does not explicitly describe a separate validation dataset split with specific percentages or counts.
Hardware Specification Yes All models are trained using a single NVIDIA A100 80GB GPU.
Software Dependencies No The paper mentions "Adam W" as an optimizer but does not specify programming languages, libraries, or other software dependencies with version numbers.
Experiment Setup Yes In our experiments, we set No and Ne to 8 by default unless otherwise specified. [...] The linear projection layer S-Proj., F-Proj. and Proj. consists of 1024, 1024, and 4096 neurons, respectively. [...] We train the models for 60 epochs with a learning rate 1e-4. [...] We set the learning rate to 2e-5. We adopt the cosine annealing learning rate. We set the batch size to 40 and train on a single A100 GPU.