Slot-VLM: Object-Event Slots for Video-Language Modeling
Authors: Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, Yan Lu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering2. |
| Researcher Affiliation | Collaboration | Jiaqi Xu1 , Cuiling Lan2, Wenxuan Xie2, Xuejin Chen1, Yan Lu2 1University of Science and Technology of China, 2Microsoft Research Asia xujiaqi@mail.ustc.edu.cn, {culan,wenxie,yanlu}@microsoft.com, xjchen99@ustc.edu.cn |
| Pseudocode | No | The paper describes its methodology in natural language and with diagrams, but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | No | This paper is the result of an open source research project starting from October, 2023. [...] We will release the code. |
| Open Datasets | Yes | We use the Video Instruction Data, collected by [28], for video instruction tuning. [...] We evaluate the performance on three open-ended video question-answering (QA) benchmarks like MSVD-QA[9], MSRVTT-QA[44], and Activity Net-QA [7]. |
| Dataset Splits | No | The paper mentions "instruction tuning" and evaluation on "test set" but does not explicitly describe a separate validation dataset split with specific percentages or counts. |
| Hardware Specification | Yes | All models are trained using a single NVIDIA A100 80GB GPU. |
| Software Dependencies | No | The paper mentions "Adam W" as an optimizer but does not specify programming languages, libraries, or other software dependencies with version numbers. |
| Experiment Setup | Yes | In our experiments, we set No and Ne to 8 by default unless otherwise specified. [...] The linear projection layer S-Proj., F-Proj. and Proj. consists of 1024, 1024, and 4096 neurons, respectively. [...] We train the models for 60 epochs with a learning rate 1e-4. [...] We set the learning rate to 2e-5. We adopt the cosine annealing learning rate. We set the batch size to 40 and train on a single A100 GPU. |