Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sparta Alignment: Collectively Aligning Multiple Language Models through Combat

Authors: Yuru Jiang, Wenxuan Ding, Shangbin Feng, Greg Durrett, Yulia Tsvetkov

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.
Researcher Affiliation Academia 1Zhejiang University 2New York University 3University of Washington
Pseudocode Yes We present an overview of SPARTA ALIGNMENT in Figure 1 and Algorithm 1, followed by detailed explanation of the three key components in the algorithm: Match-Making System, Judgment Aggregation, and Reputation System.
Open Source Code Yes Resources available at https://github.com/yurujiang2003/sparta.
Open Datasets Yes We evaluate SPARTA ALIGNMENT across 8 tasks and 12 datasets spanning three evaluation domains: (1) Domain-Specific Question Answering, including Med QA-US (Med QA) [46] and Normad [73]; (2) Reasoning, covering GSM8K [12], Knowledge Crosswords (KCross) [16], COM2 [22], and MATH [36]; (3) Instruction Following and Safety, evaluated on Alpaca [18] for instruction-following and Truthful QA (Truthful) [59]
Dataset Splits Yes The statistics for the datasets used in our main experiments are summarized in Table 4. The table presents the number of training, validation, and test examples for each of the 12 distinct tasks. Each dataset is split into three parts: Train, Validation, and Test, where the validation set is constructed by splitting the original test set evenly, ensuring balanced evaluation during development and final testing.
Hardware Specification Yes Experiments are performed on a cluster with 16 A100 GPUs with 40 GB memory.
Software Dependencies No The paper mentions methods like DPO [72] and LoRA [38] but does not specify software names with version numbers for their implementation (e.g., Python, PyTorch versions or specific library versions).
Experiment Setup Yes For SPARTA ALIGNMENT, we set the number of prompts per iteration to 1000, number of iterations T = 8, α = 0.6, a top-k threshold of k = 5, and Îș = 1. At the end of each iteration, all models are fine-tuned via DPO [72] for 1 epoch with a starting learning rate of 1e-6 and an effective batch size of 1, with the same Lo RA configuration in SFT phase in Appendix C.3. Appendix C.3 states: Fine-tuning is performed with Lo RA [38], employing a learning rate of 2e-4, cosine learning rate scheduling, an effective batch size of 32, a warm-up ratio of 0.1, and 5 default training epochs.