Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Probabilistic Token Alignment for Large Language Model Fusion

Authors: Runjia Zeng, James Liang, Cheng Han, Zhiwen Cao, Jiahao Liu, Xiaojun Quan, Yingjie Victor Chen, Lifu Huangg, Tong Geng, Qifan Wang, Dongfang Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results demonstrate that probabilistic token alignment enhances the target model s performance across multiple capabilities. Our code is avaliable at runjia.tech/neurips_pta-llm. 4 Experiments 4.1 Experimental Setup 4.2 Main Results Table 2 presents the overall performance of PTA-LLM compared to three sets of baseline models (i.e., source LLMs, Llama-2 CLM and FUSELLM). The results indicate that the original LLMs exhibit varying performance across the six benchmarks, with Llama-2 generally achieving the best results, while MPT demonstrates the weakest overall performance.
Researcher Affiliation	Collaboration	Runjia Zeng1, James Chenhao Liang2, Cheng Han3, Zhiwen Cao4, Jiahao Liu5, Xiaojun Quan6, Yingjie Victor Chen7, Lifu Huang8, Tong Geng9, 10, Qifan Wang11, Dongfang Liu1 1Rochester Institute of Technology 2U.S. Naval Research Laboratory 3University of Missouri-Kansas City 4Adobe 5Meituan 6Sun Yat-sen University 7Purdue University 8UC Davis 9University of Rochester 10Rice University 11Meta AI Corresponding author
Pseudocode	Yes	Algorithm 1 Sinkhorn Algorithm for Optimal Transport Algorithm 2 Probabilistic Token Alignment
Open Source Code	No	Our code is avaliable at runjia.tech/neurips_pta-llm. Reproducibility PTA-LLM is implemented in Pytorch [26] using the Huggingface Transformers library [27], accelerated by Flash Attention [28]. Our full implementation will be publicly released. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We claim reproducibility in both 4.1 and S2. All the datasets included in our study are publicly available (Minipile). Our code will be publicly available after acceptance. The publicly available code should be adequate to replicate the primary experimental results.
Open Datasets	Yes	Training details We fine-tune the Llama-2 7B model using a batch size of 256 and a maximum sequence length of 2,048 tokens with a combination weight (i.e., the λ in 3.1 ) of 0.8 on Mini Pile [25] following [11]. More details are presented in S2. Evaluation We evaluate PTA-LLM on six benchmarks (see details in S3) that span various core capabilities of LLMs, including reasoning, coding, commonsense, safty and multilingual ability. S3 Details of Dataset The Grade School Math [32], proposed by Open AI... Big-Bench Hard (BBH) [34]... Multi PL-E (ME) [36]... Measuring Massive Multitask Language Understanding (MMLU) [39]... Toxi Gen [40]... Ty Di QA [41]...
Dataset Splits	Yes	Training details We fine-tune the Llama-2 7B model using a batch size of 256 and a maximum sequence length of 2,048 tokens with a combination weight (i.e., the λ in 3.1 ) of 0.8 on Mini Pile [25] following [11]. Mini Pile is a compact yet diverse training dataset consisting of up to 1 million samples, carefully curated from the Pile to preserve the original corpus s richness across various domains while maintaining a manageable size for efficient experimentation. S3 introduces the Datasets we applied during our experiments. S7.3 Training Time ...using subsets of Mini Pile in Tab. S6, which consists of 1M training samples. Specifically, we randomly sampled the original dataset to create training subsets.
Hardware Specification	Yes	Training Time Training is conducted on 8 NVIDIA A100-80GB GPUs (approximately 26 hours for a single epoch) and 8 NVIDIA H100-80GB GPUs (approximately 17 hours for a single epoch), while conducting evaluation on 4 NVIDIA A100-40GB GPUs (time varies depending on the amount of benchmark data used).
Software Dependencies	No	Reproducibility PTA-LLM is implemented in Pytorch [26] using the Huggingface Transformers library [27], accelerated by Flash Attention [28]. Our full implementation will be publicly released. For the training acceleration, we leverage Deepseepd [44] and Flash Attention [28].
Experiment Setup	Yes	Training details We fine-tune the Llama-2 7B model using a batch size of 256 and a maximum sequence length of 2,048 tokens with a combination weight (i.e., the λ in 3.1 ) of 0.8 on Mini Pile [25] following [11]. More details are presented in S2. S8 Per-task Results on Different Benchmarks For the training acceleration, we leverage Deepseepd [44] and Flash Attention [28]. More specifically, we optimize our model using the Adam W optimizer, with hyperparameters set to β1 = 0.9 and β2 = 0.95, applying gradient clipping at 1.0 and a weight decay of 0.05. The learning rate follows a cosine schedule, peaking at 1 10 5, with a warmup ratio of 0.008.