Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

Authors: Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In extensive empirical evaluations across diverse video question answering benchmarks, the ISR-DPO significantly outperforms the state of the art.
Researcher Affiliation	Academia	1Seoul National University 2Yonsei University 3University of Minnesota EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the stages of ISR-DPO in Section 3 and Figure 3, but it does not present them in a structured pseudocode or algorithm block format. The steps are described in narrative text and mathematical equations.
Open Source Code	Yes	We are committed to open-sourcing our code, models, and datasets to encourage further investigation.
Open Datasets	Yes	Our training dataset utilizes a fixed set of 17k video-instruction ({V, x}) pairs from (Zhang et al. 2024a), in contrast to previous works (Yuan et al. 2024; Chen et al. 2024) that incremented their dataset across iterations. For all iterations beyond the initial VLMM πθ1, we generate preference dataset Dpref t at each iteration by generating new responses and preferences. Following (Maaz et al. 2024; Zhang et al. 2024a), we evaluate our method on two types of video question answering datasets: one that requires concise responses, and the other that demands comprehensive answers, across 7 video collections.
Dataset Splits	Yes	Our training dataset utilizes a fixed set of 17k video-instruction ({V, x}) pairs from (Zhang et al. 2024a)... Following (Maaz et al. 2024; Zhang et al. 2024a), we evaluate our method on two types of video question answering datasets: one that requires concise responses, and the other that demands comprehensive answers, across 7 video collections.
Hardware Specification	Yes	Training is conducted on 8 NVIDIA A100 GPUs (80G).
Software Dependencies	No	The paper mentions several models and methods like DPO, VLMM, and LLM, and refers to specific model sizes (e.g., 7B-sized model), but does not provide specific version numbers for software dependencies or libraries used for implementation.
Experiment Setup	Yes	We perform full-parameter fine-tuning using DPO with 9 total iterations... All generative processes use specific prompts... We employ a 7B-sized model for fair comparison with others. ... We generate two different responses for the input video V and question x using a high temperature hyper-parameter (e.g., 0.7).