Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

Authors: Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, Hongxia Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments on 11 widely-used benchmarks demonstrate that Infi FPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, Infi FPO improve its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.
Researcher Affiliation	Collaboration	1The Hong Kong Polytechnic University 2Infi X.ai 3Zhejiang University
Pseudocode	No	The paper describes its methodology through mathematical derivations and textual explanations of its components and strategies (Fuse RLHF, FPO objective, Length Normalization, Probability Clipping, Max-margin Fusion), but it does not present a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Project Page: https://github.com/Infi XAI/Infi FPO
Open Datasets	Yes	We constructed a new training dataset comprising 150k examples across mathematics, coding, and general tasks. Data sources include Infinity-Instruct [14], Numina Math-1.5 [15], and Kod Code-V1-SFT [16], with detailed statistics provided in Table 1.
Dataset Splits	Yes	In the first stage, we performed SFT on half of our dataset with yw for 3 epochs, using a learning rate of 1e-6 to build the SFT model. This model then served as the foundation for the second stage, where we conducted Preference Optimization on the remaining half of the data for a single epoch, with a learning rate of 1e-7 and β = 2.5.
Hardware Specification	Yes	Our training process involved two stages with a batch size of 128 and a maximum sequence length of 4,096 tokens, using 16 NVIDIA A800-80GB GPUs.
Software Dependencies	No	The paper mentions "vLLM for acceleration" in Appendix E.2.1 and implicitly uses frameworks compatible with "NVIDIA A800-80GB GPUs" (e.g., CUDA, PyTorch), but it does not provide specific version numbers for any software components.
Experiment Setup	Yes	Our training process involved two stages with a batch size of 128 and a maximum sequence length of 4,096 tokens, using 16 NVIDIA A800-80GB GPUs. We implemented a cosine learning rate schedule with a 10% warmup ratio. In the first stage, we performed SFT on half of our dataset with yw for 3 epochs, using a learning rate of 1e-6 to build the SFT model. This model then served as the foundation for the second stage, where we conducted Preference Optimization on the remaining half of the data for a single epoch, with a learning rate of 1e-7 and β = 2.5.