Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding
Authors: Yinuo Jing, Ruxu Zhang, Kongming Liang, Yongxiang Li, Zhongjiang He, Zhanyu Ma, Jun Guo
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated 8 current multimodal video models on our benchmark and found considerable room for improvement. We hope our work provides insights for the community and opens up new avenues for research in multimodal video models. Our data and code will be released at https://github.com/PRIS-CV/Animal-Bench. |
| Researcher Affiliation | Collaboration | 1School of Artificial Intelligence, Beijing University of Posts and Telecommunications 2China Telecom Artificial Intelligence Technology Co. Ltd |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our data and code will be released at https://github.com/PRIS-CV/Animal-Bench. |
| Open Datasets | Yes | In our paper, we use data from six datasets: Animal-Kingdom [37], Mammal Net [38], Lo TE-Animal [58], MSRVTT-QA [42], NEx T-QA [40], and TGIF-QA [57]. We appreciate the contributions of the aforementioned works, all of which have been cited in the main article. Specifically: Mammal Net is licensed under the CC BY license. Lo TE-Animal is licensed under the Creative Commons Attribution-Share Alike 4.0 International License. MSRVTT-QA is licensed under the MIT license. NEx T-QA is licensed under the MIT license. For the Animal-Kingdom dataset, we have contacted the authors by filling out a questionnaire regarding the dataset’s use and have obtained an official download link. Additionally, we have emailed the authors about our use of the dataset in our paper. The TGIF-QA dataset is explicitly stated on its Git Hub page “to be free to use for academic purposes only.” |
| Dataset Splits | No | The paper describes its Animal-Bench as an evaluation benchmark, which serves as a test set. It does not provide specific training/validation dataset splits for training a new model, as its focus is on evaluating existing models. |
| Hardware Specification | Yes | We conducted all tests for multimodal video models on an NVIDIA RTX 4090 with 24GB of VRAM. |
| Software Dependencies | Yes | During video editing, we utilize Stable Diffusion-inpainting to expand the scenes beyond the captured footage and subsequently apply Stable Diffusion-v1.5 for noise addition in frame blending. |
| Experiment Setup | Yes | To ensure fair comparisons, we standardize the 7B LLM backend versions used across all multi-modal video models tested during inference, thereby minimizing discrepancies in language proficiency due to differences in model sizes. Following the methodology outlined in [3], we establish a uniform system prompt and adopt the prompt-based model output matching strategy. All generated outputs successfully match the corresponding options. For each video, we sample 16 frames and resize them to (224, 224). |