Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reinforcement Learning Under Moral Uncertainty
Authors: Adrien Ecoffet, Joel Lehman
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The results illustrate (1) how such uncertainty can help curb extreme behavior from commitment to single theories and (2) several technical complications arising from attempting to ground moral philosophy in RL (e.g. how can a principled trade-off between two competing but incomparable reward functions be reached). We now illustrate various properties of the voting systems for moral uncertainty introduced in this work, and in particular focus on the trade-offs that exist between them. The code for all the experiments presented in this section can be found at https://github.com/uber-research/normative-uncertainty. |
| Researcher Affiliation | Industry | 1Uber AI Labs, San Francisco, CA, USA 2Open AI, San Francisco, CA, USA (work done at Uber AI Labs). |
| Pseudocode | Yes | In our implementation, ϵ is annealed to 0 by the end of training (SI E.1). We call this algorithm Variance-SARSA (pseudocode is provided in the SI). |
| Open Source Code | Yes | The code for all the experiments presented in this section can be found at https://github.com/uber-research/normative-uncertainty. |
| Open Datasets | No | Our experiments are based on four related gridworld environments (Fig. 1) that tease out differences between various voting systems. These environments are derived from the trolley problem (Foot, 1967), commonly used within moral philosophy to highlight moral intuitions and conflicts between ethical theories. |
| Dataset Splits | No | All the experiments in this work use short, episodic environments, allowing us to set γi = 1 (i.e. undiscounted rewards) across all of them for simplicity. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) are provided for the experimental setup. |
| Software Dependencies | No | No specific software dependencies with version numbers are listed in the paper. |
| Experiment Setup | Yes | In our implementation, ϵ is annealed to 0 by the end of training (SI E.1). where ε is a small constant (10 6 in our experiments) to handle theories with σ2 i = 0. |