Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Aligning Compound AI Systems via System-level DPO

Authors: Xiangwen Wang, Yibo Jacky Zhang, Zhoujie Ding, Katherine Tsai, Haolun Wu, Sanmi Koyejo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the effectiveness of our approach across two applications: the joint alignment of a language model and a diffusion model, and the joint alignment of an LLM collaboration system. [...] These results deepen our understanding of alignment challenges in compound AI systems and provide a foundation for future research.
Researcher Affiliation Academia Xiangwen Wang 1,2 Yibo Jacky Zhang1 Zhoujie Ding1 Katherine Tsai1 Haolun Wu1,3 Sanmi Koyejo1 1Stanford University 2University of Science and Technology of China 3Mila Quebec AI Institute EMAIL, EMAIL, EMAIL EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methods and procedures using prose and mathematical equations but does not present any structured pseudocode or algorithm blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: code will be published later.
Open Datasets Yes We employ the preference dataset Intel/orca-dpo-pairs [32] for DPO training, consisting of 129000 instructions paired with corresponding preference examples (each instruction has a pair of chosen/rejected responses).
Dataset Splits Yes We sample 193 instruction data points as the evaluation set. The remaining examples are used for the training process.
Hardware Specification Yes Aligning the entire system took approximately 30 hours on a single NVIDIA H200 GPU.
Software Dependencies No The paper mentions specific models like Llama-3-8B, Stable Diffusion XL, Stable Diffusion 1.5, and Qwen1.5-1.8B-Chat, and training techniques like LoRA. However, it does not explicitly state versions for general software dependencies such as programming languages (e.g., Python), frameworks (e.g., PyTorch, TensorFlow), or other libraries (e.g., scikit-learn, CUDA).
Experiment Setup Yes During training, we sampled two intermediate outputs per step during training. ... During both training and evaluation, we set the maximum token number at 256 in the sampling process. We trained the models using LoRA with β = 0.5, a learning rate of 1e-7, and an accumulated batch size of 128.