Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Authors: Pritam Sarkar, Ali Etemad
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments and analysis validate the effectiveness of our approach across diverse video tasks, including video hallucination, shortand long-video understanding, and fine-grained temporal reasoning. |
| Researcher Affiliation | Academia | Pritam Sarkar Queen s University, Canada and Vector Institute EMAIL Ali Etemad Queen s University, Canada EMAIL |
| Pseudocode | Yes | A RRPO Pseudocode (Py Torch Style) |
| Open Source Code | Yes | We make our code, data, and model weights public to enable fast and accurate reproducibility. The data and code are shared through an anonymized repository during the review process for reproducibility. |
| Open Datasets | Yes | Based on the availability and diversity of video-language instructions, we use Video Chat-IT [3] as our primary source for training samples. Specifically, we select a subset of Video Chat-IT encompassing eight video datasets: Kinetics700 [37], Something-Something-v2 [38], Video Chat [39], Video Chat GPT [40], CLEVRER [41], NEXTQA [42], Ego QA [43], and TGIF [44]. |
| Dataset Splits | Yes | Evaluation benchmarks. To assess the impact of our self-alignment framework, we conduct evaluations across a diverse range of video understanding tasks. Specifically, we choose TVBench [17] and Temp Compass [20] for fine-grained temporal understanding, Video Hallucer [19] and Vid Halluc [51] for video hallucination, MVBench [3] and Video MME [52] for short video understanding, and MLVU [24] and Long Video Bench [25] for long video understanding. |
| Hardware Specification | Yes | We use 4 A100 80GB GPUs for training, with the training time varying between 1 to 10 hours. |
| Software Dependencies | No | The paper implies the use of PyTorch through its pseudocode (import torch), but does not specify version numbers for PyTorch or any other libraries. While it mentions LLM architectures and vision encoders, it doesn't provide specific software versions for them or other dependencies. |
| Experiment Setup | Yes | Table S6: Details of training hyperparameters. Video Chat2 LLa VA-Video Long VU LLM Mistral Qwen2 Qwen2 Vision encoder UMT Sig LIP Sig LIP+DINOv2 Trainable module Lo RA in LLM and everything else is kept frozen Lo RA setup [50] rank=128, alpha=256 Learning rate 2e-5 5e-6 5e-6 Learning rate scheduler Cosine Cosine Cosine Optimizer Adam W Adam W Adam W Weight decay 0.02 0.0 0.0 Warmup ratio 0.03 0.03 Epoch 1 1 1 Batch size per GPU 2 1 1 Batch size (total) 32 32 32 α (loss coefficient) 0.01 0.01 0.05 β (loss coefficient) 0.9 0.1 0.5 Memory optimization Zero stage 3 [84, 85] FSDP |