Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding

Authors: Yinuo Jing, Ruxu Zhang, Kongming Liang, Yongxiang Li, Zhongjiang He, Zhanyu Ma, Jun Guo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated 8 current multimodal video models on our benchmark and found considerable room for improvement. We hope our work provides insights for the community and opens up new avenues for research in multimodal video models. Our data and code will be released at https://github.com/PRIS-CV/Animal-Bench.
Researcher Affiliation Collaboration 1School of Artificial Intelligence, Beijing University of Posts and Telecommunications 2China Telecom Artificial Intelligence Technology Co. Ltd
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our data and code will be released at https://github.com/PRIS-CV/Animal-Bench.
Open Datasets Yes In our paper, we use data from six datasets: Animal-Kingdom [37], Mammal Net [38], Lo TE-Animal [58], MSRVTT-QA [42], NEx T-QA [40], and TGIF-QA [57]. We appreciate the contributions of the aforementioned works, all of which have been cited in the main article. Specifically: Mammal Net is licensed under the CC BY license. Lo TE-Animal is licensed under the Creative Commons Attribution-Share Alike 4.0 International License. MSRVTT-QA is licensed under the MIT license. NEx T-QA is licensed under the MIT license. For the Animal-Kingdom dataset, we have contacted the authors by filling out a questionnaire regarding the dataset’s use and have obtained an official download link. Additionally, we have emailed the authors about our use of the dataset in our paper. The TGIF-QA dataset is explicitly stated on its Git Hub page “to be free to use for academic purposes only.”
Dataset Splits No The paper describes its Animal-Bench as an evaluation benchmark, which serves as a test set. It does not provide specific training/validation dataset splits for training a new model, as its focus is on evaluating existing models.
Hardware Specification Yes We conducted all tests for multimodal video models on an NVIDIA RTX 4090 with 24GB of VRAM.
Software Dependencies Yes During video editing, we utilize Stable Diffusion-inpainting to expand the scenes beyond the captured footage and subsequently apply Stable Diffusion-v1.5 for noise addition in frame blending.
Experiment Setup Yes To ensure fair comparisons, we standardize the 7B LLM backend versions used across all multi-modal video models tested during inference, thereby minimizing discrepancies in language proficiency due to differences in model sizes. Following the methodology outlined in [3], we establish a uniform system prompt and adopt the prompt-based model output matching strategy. All generated outputs successfully match the corresponding options. For each video, we sample 16 frames and resize them to (224, 224).