Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Aligning Compound AI Systems via System-level DPO
Authors: Xiangwen Wang, Yibo Jacky Zhang, Zhoujie Ding, Katherine Tsai, Haolun Wu, Sanmi Koyejo
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the effectiveness of our approach across two applications: the joint alignment of a language model and a diffusion model, and the joint alignment of an LLM collaboration system. [...] These results deepen our understanding of alignment challenges in compound AI systems and provide a foundation for future research. |
| Researcher Affiliation | Academia | Xiangwen Wang 1,2 Yibo Jacky Zhang1 Zhoujie Ding1 Katherine Tsai1 Haolun Wu1,3 Sanmi Koyejo1 1Stanford University 2University of Science and Technology of China 3Mila Quebec AI Institute EMAIL, EMAIL, EMAIL EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methods and procedures using prose and mathematical equations but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: code will be published later. |
| Open Datasets | Yes | We employ the preference dataset Intel/orca-dpo-pairs [32] for DPO training, consisting of 129000 instructions paired with corresponding preference examples (each instruction has a pair of chosen/rejected responses). |
| Dataset Splits | Yes | We sample 193 instruction data points as the evaluation set. The remaining examples are used for the training process. |
| Hardware Specification | Yes | Aligning the entire system took approximately 30 hours on a single NVIDIA H200 GPU. |
| Software Dependencies | No | The paper mentions specific models like Llama-3-8B, Stable Diffusion XL, Stable Diffusion 1.5, and Qwen1.5-1.8B-Chat, and training techniques like LoRA. However, it does not explicitly state versions for general software dependencies such as programming languages (e.g., Python), frameworks (e.g., PyTorch, TensorFlow), or other libraries (e.g., scikit-learn, CUDA). |
| Experiment Setup | Yes | During training, we sampled two intermediate outputs per step during training. ... During both training and evaluation, we set the maximum token number at 256 in the sampling process. We trained the models using LoRA with β = 0.5, a learning rate of 1e-7, and an accumulated batch size of 128. |