Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

Authors: Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In extensive empirical evaluations across diverse video question answering benchmarks, the ISR-DPO significantly outperforms the state of the art.
Researcher Affiliation Academia 1Seoul National University 2Yonsei University 3University of Minnesota EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the stages of ISR-DPO in Section 3 and Figure 3, but it does not present them in a structured pseudocode or algorithm block format. The steps are described in narrative text and mathematical equations.
Open Source Code Yes We are committed to open-sourcing our code, models, and datasets to encourage further investigation.
Open Datasets Yes Our training dataset utilizes a fixed set of 17k video-instruction ({V, x}) pairs from (Zhang et al. 2024a), in contrast to previous works (Yuan et al. 2024; Chen et al. 2024) that incremented their dataset across iterations. For all iterations beyond the initial VLMM πθ1, we generate preference dataset Dpref t at each iteration by generating new responses and preferences. Following (Maaz et al. 2024; Zhang et al. 2024a), we evaluate our method on two types of video question answering datasets: one that requires concise responses, and the other that demands comprehensive answers, across 7 video collections.
Dataset Splits Yes Our training dataset utilizes a fixed set of 17k video-instruction ({V, x}) pairs from (Zhang et al. 2024a)... Following (Maaz et al. 2024; Zhang et al. 2024a), we evaluate our method on two types of video question answering datasets: one that requires concise responses, and the other that demands comprehensive answers, across 7 video collections.
Hardware Specification Yes Training is conducted on 8 NVIDIA A100 GPUs (80G).
Software Dependencies No The paper mentions several models and methods like DPO, VLMM, and LLM, and refers to specific model sizes (e.g., 7B-sized model), but does not provide specific version numbers for software dependencies or libraries used for implementation.
Experiment Setup Yes We perform full-parameter fine-tuning using DPO with 9 total iterations... All generative processes use specific prompts... We employ a 7B-sized model for fair comparison with others. ... We generate two different responses for the input video V and question x using a high temperature hyper-parameter (e.g., 0.7).