Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Aligning Compound AI Systems via System-level DPO

Authors: Xiangwen Wang, Yibo Jacky Zhang, Zhoujie Ding, Katherine Tsai, Haolun Wu, Sanmi Koyejo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate the effectiveness of our approach across two applications: the joint alignment of a language model and a diffusion model, and the joint alignment of an LLM collaboration system. [...] These results deepen our understanding of alignment challenges in compound AI systems and provide a foundation for future research.
Researcher Affiliation	Academia	Xiangwen Wang 1,2 Yibo Jacky Zhang1 Zhoujie Ding1 Katherine Tsai1 Haolun Wu1,3 Sanmi Koyejo1 1Stanford University 2University of Science and Technology of China 3Mila Quebec AI Institute EMAIL, EMAIL, EMAIL EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methods and procedures using prose and mathematical equations but does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: code will be published later.
Open Datasets	Yes	We employ the preference dataset Intel/orca-dpo-pairs [32] for DPO training, consisting of 129000 instructions paired with corresponding preference examples (each instruction has a pair of chosen/rejected responses).
Dataset Splits	Yes	We sample 193 instruction data points as the evaluation set. The remaining examples are used for the training process.
Hardware Specification	Yes	Aligning the entire system took approximately 30 hours on a single NVIDIA H200 GPU.
Software Dependencies	No	The paper mentions specific models like Llama-3-8B, Stable Diffusion XL, Stable Diffusion 1.5, and Qwen1.5-1.8B-Chat, and training techniques like LoRA. However, it does not explicitly state versions for general software dependencies such as programming languages (e.g., Python), frameworks (e.g., PyTorch, TensorFlow), or other libraries (e.g., scikit-learn, CUDA).
Experiment Setup	Yes	During training, we sampled two intermediate outputs per step during training. ... During both training and evaluation, we set the maximum token number at 256 in the sampling process. We trained the models using LoRA with β = 0.5, a learning rate of 1e-7, and an accumulated batch size of 128.