Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unified Reinforcement and Imitation Learning for Vision-Language Models

Authors: Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Frank Wang, Yueh-Hua Wu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them. Performance Improvements (%) across Evaluation Benchmarks (Figure 1). Comparing RIL-applied VLMs based on multi large VLMs with diverse open- and closed-source VLMs, under average performance of numerous vision-language evaluation benchmarks (Figure 2).
Researcher Affiliation Collaboration Byung-Kwan Lee NVIDIA, KAIST EMAIL Ryo Hachiuma NVIDIA EMAIL Yong Man Ro KAIST EMAIL Yu-Chiang Frank Wang NVIDIA, National Taiwan University EMAIL Yueh-Hua Wu NVIDIA EMAIL
Pseudocode Yes Algorithm 1 RL purely with GRPO or Dr.GRPO based on accuracy rewards from LLM-as-a-Judge Algorithm 2 RIL for VLMs
Open Source Code No Dataset is all open-source, but the code and the model checkpoints will be open once it is accepted.
Open Datasets Yes Our dataset integrates both real-world and synthetic sources: COCO-Re M [83], i Naturalist2018 [84], VQA-v2 [85], Super-CLEVR [86], MAVIS [87], Geometry3K [88], SQA [89], AI2D [2], SA-1B [60], LLa VAR [90], VSR [91], Tally QA [92], Tab MWP [93], Kon IQ [94], Intern VL [95]-filtered synthetic knowledge dataset covering politics, math, physics, chemistry, RLAI-F [96], CLEVR-Math [97], SROIE [98], Chart QA [19], Doc VQA [99], Figure QA [100], GQA [101], Info VQA [102], M3Co T [103], Map QA [104], OK-VQA [105], Text VQA [106], Wild Vision [107], DVQA [108], Geo QA+ [109], Ge OS [110], Icon QA [111], Uni GEO [112], Geom Verse [113], Geo170K [114], Math V360K [115], multimodal wikipedia knowledge [116], Info Seek [117], and RAM++ [118]-filtered synthetic data of Infinity MM [30] covering coarse and fine-grained perception, relation, attribute, and logic reasoning.
Dataset Splits No For SFT, we utilize the entire 4M-sample dataset, and then we curate a 40K-sample dataset for RIL of VLMs based on log-probability sampling [82] and overlong filtering [36]. The paper mentions using
Hardware Specification Yes We train and evaluate RIL, mainly on NVIDIA A100 80GB GPUs. ...Using 256 NVIDIA A100 GPUs, pre-training the discriminator on 1.2M samples... takes approximately 1 to 3 days. The SFT step on the 4M-sample SFT dataset takes around 3 to 5 days. Conducting the RIL loop for the sampled 40K data requires an additional 3 to 5 days using 8 NVIDIA A100 GPUs.
Software Dependencies No we utilize v LLM [79] built on Paged Attention. ...we use Deep Speed engine with Ze RO-3 [80] for 8 GPUs, and we use Adam W optimizer [81]... The paper lists software components like vLLM, Paged Attention, Deep Speed, ZeRO-3, and AdamW, but does not specify their version numbers.
Experiment Setup Yes we use Adam W optimizer [81] and apply a linearly decayed learning rate from 1e-5 to 1e-6 to pre-training discriminator and SFT of student VLMs. In subsequent step... Mimicking step requires static learning rate 1e-6 and ยต=1 iteration to train both student VLMs and the discriminator. Note that, when we generate text responses, we generate G=4 responses for each question, by setting temperature to 1.0, top-p to 0.95, top-k to 50, and repetition penalty to 1.05, in order to get diverse text responses. For stable training, we handle large batch sizes by using gradient accumulation with 6 steps. At every step, we use 4 batches per one GPU, leading to total 144 batches. ...the clipping hyperparameter ฯต is consistently set to 0.2.