Reinforcement Learning Under Moral Uncertainty

Authors: Adrien Ecoffet, Joel Lehman

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The results illustrate (1) how such uncertainty can help curb extreme behavior from commitment to single theories and (2) several technical complications arising from attempting to ground moral philosophy in RL (e.g. how can a principled trade-off between two competing but incomparable reward functions be reached). We now illustrate various properties of the voting systems for moral uncertainty introduced in this work, and in particular focus on the trade-offs that exist between them. The code for all the experiments presented in this section can be found at https://github.com/uber-research/normative-uncertainty.
Researcher Affiliation Industry 1Uber AI Labs, San Francisco, CA, USA 2Open AI, San Francisco, CA, USA (work done at Uber AI Labs).
Pseudocode Yes In our implementation, ϵ is annealed to 0 by the end of training (SI E.1). We call this algorithm Variance-SARSA (pseudocode is provided in the SI).
Open Source Code Yes The code for all the experiments presented in this section can be found at https://github.com/uber-research/normative-uncertainty.
Open Datasets No Our experiments are based on four related gridworld environments (Fig. 1) that tease out differences between various voting systems. These environments are derived from the trolley problem (Foot, 1967), commonly used within moral philosophy to highlight moral intuitions and conflicts between ethical theories.
Dataset Splits No All the experiments in this work use short, episodic environments, allowing us to set γi = 1 (i.e. undiscounted rewards) across all of them for simplicity.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) are provided for the experimental setup.
Software Dependencies No No specific software dependencies with version numbers are listed in the paper.
Experiment Setup Yes In our implementation, ϵ is annealed to 0 by the end of training (SI E.1). where ε is a small constant (10 6 in our experiments) to handle theories with σ2 i = 0.