Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Authors: Zhicheng Zhang, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di ZHANG, Jufeng Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.
Researcher Affiliation Collaboration 1 Nankai University 2 Pengcheng Laboratory 3 Kuaishou Technology 4 Nankai International Advanced Research Institute (SHENZHEN FUTIAN)
Pseudocode No The paper describes the pipeline and reasoning framework conceptually and visually in Figure 3 but does not contain a formally structured pseudocode or algorithm block.
Open Source Code Yes The code and the representative data samples are provided in the supplementary zip file. The full data will be publicly released after acceptance.
Open Datasets Yes Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. The full data will be publicly released after acceptance.
Dataset Splits Yes All the data used for training are sampled from the training or validation split of the source datasets. All the data used for evaluation are sampled from the testing split of the source datasets.
Hardware Specification No The paper mentions using a Vision Transformer (Vi T) and specific models like Qwen2.5-VL but does not specify the hardware (e.g., GPU model, CPU) used for running the experiments.
Software Dependencies No Following the approach outlined by Qwen2.5-VL [65], we adopt a Vision Transformer (Vi T)-based architecture for the visual encoder and utilize an autoregressive model for the text encoder. For the foundational large language model (LLM), we select models within the 3B to 7B parameter range. Vid Emo is pre-trained for 3 epochs with a batch size of 1024 and is subsequently post-trained for 1 epoch with a batch size of 128. We employ the Adam W [48] optimizer with a cosine learning rate schedule.
Experiment Setup Yes Vid Emo is pre-trained for 3 epochs with a batch size of 1024 and is subsequently post-trained for 1 epoch with a batch size of 128. We employ the Adam W [48] optimizer with a cosine learning rate schedule. The learning rate is set to 2e-5 for SFT stage and 1e-5 for RL stage, with a warmup rate of 0.03.