Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning
Authors: Yizhen Zhang, Yang Ding, Shuoshuo Zhang, Xinchen Zhang, Haoling Li, Zhong-Zhi Li, Peijie Wang, Jie Wu, Lei Ji, Yeyun Gong, yelong shen, Yujiu Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that Pe RL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks. |
| Researcher Affiliation | Collaboration | Yizhen Zhangϕπ Yang Dingϕ Shuoshuo Zhangϕπ Xinchen Zhangϕ Haoling Liϕπ Zhong-Zhi Liρπ Peijie Wangρ Jie Wuϕπ Lei Jiπ Yeyun Gongπ Yelong Shenπ Yujiu Yangϕ ϕTsinghua University πMicrosoft ρCASIA |
| Pseudocode | Yes | Algorithm 1 Pe RL: Permutation-Enhanced Reinforcement Learning |
| Open Source Code | Yes | https://github.com/alchemistyzz/Pe RL |
| Open Datasets | Yes | Our training data comprise two parts: 22K multi-image instruction examples curated from the 721K examples in Mantis-Instruct [19], and 36K single-image examples from the K12 dataset for RL. We conduct experiments on both multi-image benchmarks and single-image benchmarks. As the main experiment, we employ Mantis-Eval [19], BLINK [14], MMIU [35] as multiimage benchmarks. Furthermore, we evaluate the generalization on widely used single-image benchmarks including Math Vista [32], Math Verse [60] and Math Vision [49]. Besides, we also evaluate our model on out-of-domain multi-image benchmarks including Remi [20] and MV-MATH [50]. |
| Dataset Splits | Yes | Our training data comprise two parts: 22K multi-image instruction examples curated from the 721K examples in Mantis-Instruct [19], and 36K single-image examples from the K12 dataset for RL. For evaluation, we use greedy decoding with temperature set to 0, top-p to 1, top-k to -1, and a maximum generation length of 2048. Evaluation benchmarks include Math Vista, Math Verse, Math Vision, and BLINK (configured via VLMEval Kit), while Mantis-Eval, MMIU, and MV-MATH (evaluated with official code via v LLM). All evaluations follow consistent decoding settings.Details are shown as Table 4 and prompt A.3 Table 4: Details of evaluation benchmarks. Benchmark Description #samples Mantis-eval Multi-image General Understanding QA 217 BLINK Multi-image General Understanding QA 1901 MMIU Multi-image General Understanding QA 11698 Math Vista Single-image Math Reasoning QA 1000 (testmini) Math Verse Single-image Math Reasoning QA 3940 Math Vision Single-image Math Reasoning QA 3040 Remi Multi-image General Reasoning 2600 MV-Math Multi-image Math Reasoning 2009 |
| Hardware Specification | Yes | We train our model on 8 H100 GPUs using the GRPO-based framework. |
| Software Dependencies | Yes | We initialize our policy with Qwen2.5-VL-7B-Instruct [7] and build on the ve RL framework [40]. |
| Experiment Setup | Yes | During RL fine-tuning, we apply one random permutation per sample (ns = 1) and generate six responses per order (n = 6), yielding 12 rollouts per input. We set the KL coefficient β = 0.01, train for 2 epochs with a learning rate of 1 10 6 and a batch size of 256. Further details are provided in the appendix. |