How RL Agents Behave When Their Actions Are Modified
Authors: Eric D. Langlois, Tom Everitt11586-11594
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we experimentally evaluate the learning algorithms and demonstrate behaviour that is consistent with the theoretical results. |
| Researcher Affiliation | Collaboration | Eric D. Langlois1,2,3, Tom Everitt1 1Deep Mind 2University of Toronto 3Vector Institute edl@cs.toronto.edu, tomeveritt@google.com |
| Pseudocode | Yes | Algorithm 1 Q Learning on a MAMDP; Algorithm 2 Virtual Sarsa on a MAMDP; Algorithm 3 Empirical Sarsa on a MAMDP |
| Open Source Code | Yes | Code available at https://github.com/edlanglois/mamdp |
| Open Datasets | No | The paper describes custom environments ("Simulation-Oversight", "Off-Switch", "Whisky-Gold") but does not provide access information (link, DOI, formal citation) for any publicly available datasets used or generated. |
| Dataset Splits | No | The paper mentions training steps and independent runs but does not specify train/validation/test splits or cross-validation setup for any dataset. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, memory amounts, or detailed computer specifications used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) needed to replicate the experiments. |
| Experiment Setup | No | The paper describes the environment specifics (e.g., states, rewards) and general training parameters (e.g., "trained to convergence", "10^7 steps") but does not provide concrete hyperparameter values (e.g., learning rate, batch size, optimizer settings) or other specific system-level training configurations in the main text. |