Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO
Authors: Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, Jonghyun Choi
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In extensive empirical evaluations across diverse video question answering benchmarks, the ISR-DPO significantly outperforms the state of the art. |
| Researcher Affiliation | Academia | 1Seoul National University 2Yonsei University 3University of Minnesota EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the stages of ISR-DPO in Section 3 and Figure 3, but it does not present them in a structured pseudocode or algorithm block format. The steps are described in narrative text and mathematical equations. |
| Open Source Code | Yes | We are committed to open-sourcing our code, models, and datasets to encourage further investigation. |
| Open Datasets | Yes | Our training dataset utilizes a fixed set of 17k video-instruction ({V, x}) pairs from (Zhang et al. 2024a), in contrast to previous works (Yuan et al. 2024; Chen et al. 2024) that incremented their dataset across iterations. For all iterations beyond the initial VLMM πθ1, we generate preference dataset Dpref t at each iteration by generating new responses and preferences. Following (Maaz et al. 2024; Zhang et al. 2024a), we evaluate our method on two types of video question answering datasets: one that requires concise responses, and the other that demands comprehensive answers, across 7 video collections. |
| Dataset Splits | Yes | Our training dataset utilizes a fixed set of 17k video-instruction ({V, x}) pairs from (Zhang et al. 2024a)... Following (Maaz et al. 2024; Zhang et al. 2024a), we evaluate our method on two types of video question answering datasets: one that requires concise responses, and the other that demands comprehensive answers, across 7 video collections. |
| Hardware Specification | Yes | Training is conducted on 8 NVIDIA A100 GPUs (80G). |
| Software Dependencies | No | The paper mentions several models and methods like DPO, VLMM, and LLM, and refers to specific model sizes (e.g., 7B-sized model), but does not provide specific version numbers for software dependencies or libraries used for implementation. |
| Experiment Setup | Yes | We perform full-parameter fine-tuning using DPO with 9 total iterations... All generative processes use specific prompts... We employ a 7B-sized model for fair comparison with others. ... We generate two different responses for the input video V and question x using a high temperature hyper-parameter (e.g., 0.7). |