Learning Human-like Representations to Enable Learning Human Values

Authors: Andrea Wynn, Ilia Sucholutsky, Tom Griffiths

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We begin with a theoretical prediction, show that it applies to learning human morality judgments, then show that our results generalize to ten different aspects of human values including ethics, honesty, and fairness training AI agents on each set of values in a multi-armed bandit setting, where rewards reflect human value judgments over the chosen action. Using a set of textual action descriptions, we collect value judgments from humans, as well as similarity judgments from both humans and multiple language models, and demonstrate that representational alignment enables both safe exploration and improved generalization when learning human values.
Researcher Affiliation Academia Andrea H. Wynn Department of Computer Science Princeton University Princeton, NJ 08542 awynn13@jhu.edu Ilia Sucholutsky Department of Computer Science Princeton University Princeton, NJ 08542 is3060@nyu.edu Thomas L. Griffiths Department of Psychology Department of Computer Science Princeton University Princeton, NJ 08542 tomg@princeton.edu Andrea Wynn is currently affiliated with the Department of Computer Science at Johns Hopkins University, Baltimore, MD. Ilia Sucholutsky is currently affiliated with the New York University Center for Data Science, New York, NY.
Pseudocode Yes Algorithm 1 Kernel-based Agent Experiment for Learning Human Value Judgments
Open Source Code Yes The code files are attached with the submission. Aggregated human value judgments are provided in the data/justice_50_actions_with_values.csv file. The aggregated human similarity judgments are included in models/embedding_kernels/survey_similarity.npy. Language model kernels were computed using compute_embedding_model_kernels.ipynb. Pre-computed language model kernels are available in models/embedding_kernels/ as .npy files.
Open Datasets Yes We first create a set of 50 textual descriptions of morally relevant actions (adapted from the Justice category of actions in the ETHICS dataset [18]).
Dataset Splits Yes The agent is only allowed to take 25 of the 50 actions (the personalization set). ... In the generalization phase, we repeat Algorithm 1, with two differences. First, instead of the 25 actions seen during the personalization phase, the agent can only choose from the 25 other actions that it has not yet seen.
Hardware Specification No All experiments were run on CPU on a university compute cluster. Not all experiments run were reported in this paper; some preliminary experiments were also run, e.g. we experimented with using binary vs. continuous rewards for the kernel methods.
Software Dependencies No The paper mentions various language models and embedding models used (e.g., 'Hugging Face sentence-transformers model zoo', 'Google s USE model', 'Doc2Vec', 'Open AI s text-embedding-ada-002 model'), but it does not specify software dependencies with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup Yes The reward distribution for each action (arm of the bandit) is parameterized by a morality score mi [ 3, 3] for each action i. We used Thompson sampling, a popular Bayesian approach for solving multi-armed bandit problems [3], as a baseline method for comparison with the kernel-based agents. Personalization phase: In the personalization phase, the agent takes actions in its environment and learns from these actions. The agent is only allowed to take 25 of the 50 actions (the personalization set). We limit the agent to 1000 time-steps to reflect real-world constraints on human-in-the-loop learning. Algorithm 1 details: Randomly select a A s.t. |a| = 10. Choose a new action x via Thompson sampling over agent s predicted rewards... Sample true reward r from a Normal distribution N(mx, 1).