Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Authors: Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W Lee, Richard Ren, Long Phan, Norman Mu, Oliver Zhang, Dan Hendrycks

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct all experiments on a curated set of 500 textual outcomes, each representing an observation about a potential state of the world. Examples are shown in Appendix A.4. Using the forced-choice procedure from Appendix A.2, we obtain pairwise preferences for 18 open-source and 5 proprietary LLMs spanning a broad range of model scales.
Researcher Affiliation Collaboration Mantas Mazeika1, Xuwang Yin1, Rishub Tamirisa1, Jaehyuk Lim2, Bruce W. Lee2, Richard Ren2, Long Phan1, Norman Mu3, Oliver Zhang1, Dan Hendrycks1 1Center for AI Safety 2University of Pennsylvania 3University of California, Berkeley
Pseudocode Yes Algorithm 1 Iterative Active Learning for Pairwise Comparisons
Open Source Code Yes Code and data for replicating experiments are available at https://github.com/centerforaisafety/emergent-values.
Open Datasets Yes Citizen profiles are sampled from the 2023 American Community Survey (ACS) 1-Year Estimates Public Use Microdata Sample provided by the U.S. Census Bureau [U.S. Census Bureau, 2023] dataset API, through which we obtain the following demographic information: age, gender, ethnicity, occupation, annual household income, marital status, and state of residence.
Dataset Splits Yes We build a preference dataset Dprefs from M = 373 possible outcomes, subsampling the complete preference graph to obtain N = 12,746 preference-elicitation questions (an 80-20 train-test split).
Hardware Specification Yes All experiments were conducted on A100 GPUs.
Software Dependencies No We fine-tune Llama-3.1-8B-Instruct [AI@Meta, 2024] for 2 epochs on 10,196 training questions with learning rate 2e-5 using AdamW [Loshchilov and Hutter, 2019]. No specific version numbers for key software components like Python, PyTorch, or CUDA are provided in the main text.
Experiment Setup Yes We fine-tune Llama-3.1-8B-Instruct [AI@Meta, 2024] for 2 epochs on 10,196 training questions with learning rate 2e-5 using AdamW [Loshchilov and Hutter, 2019].