Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Authors: Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W Lee, Richard Ren, Long Phan, Norman Mu, Oliver Zhang, Dan Hendrycks

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct all experiments on a curated set of 500 textual outcomes, each representing an observation about a potential state of the world. Examples are shown in Appendix A.4. Using the forced-choice procedure from Appendix A.2, we obtain pairwise preferences for 18 open-source and 5 proprietary LLMs spanning a broad range of model scales.
Researcher Affiliation	Collaboration	Mantas Mazeika1, Xuwang Yin1, Rishub Tamirisa1, Jaehyuk Lim2, Bruce W. Lee2, Richard Ren2, Long Phan1, Norman Mu3, Oliver Zhang1, Dan Hendrycks1 1Center for AI Safety 2University of Pennsylvania 3University of California, Berkeley
Pseudocode	Yes	Algorithm 1 Iterative Active Learning for Pairwise Comparisons
Open Source Code	Yes	Code and data for replicating experiments are available at https://github.com/centerforaisafety/emergent-values.
Open Datasets	Yes	Citizen profiles are sampled from the 2023 American Community Survey (ACS) 1-Year Estimates Public Use Microdata Sample provided by the U.S. Census Bureau [U.S. Census Bureau, 2023] dataset API, through which we obtain the following demographic information: age, gender, ethnicity, occupation, annual household income, marital status, and state of residence.
Dataset Splits	Yes	We build a preference dataset Dprefs from M = 373 possible outcomes, subsampling the complete preference graph to obtain N = 12,746 preference-elicitation questions (an 80-20 train-test split).
Hardware Specification	Yes	All experiments were conducted on A100 GPUs.
Software Dependencies	No	We fine-tune Llama-3.1-8B-Instruct [AI@Meta, 2024] for 2 epochs on 10,196 training questions with learning rate 2e-5 using AdamW [Loshchilov and Hutter, 2019]. No specific version numbers for key software components like Python, PyTorch, or CUDA are provided in the main text.
Experiment Setup	Yes	We fine-tune Llama-3.1-8B-Instruct [AI@Meta, 2024] for 2 epochs on 10,196 training questions with learning rate 2e-5 using AdamW [Loshchilov and Hutter, 2019].