Avoiding Side Effects By Considering Future Tasks
Authors: Victoria Krakovna, Laurent Orseau, Richard Ngo, Miljan Martic, Shane Legg
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then demonstrate the following on gridworld environments (Section 6): 1. Reversibility reward fails to avoid side effects if the current task requires irreversible actions. 2. Future task reward without a baseline policy shows interference behavior in a dynamic environment. 3. Future task reward with a baseline policy successfully avoids side effects and interference. [...] Table 1: Results on the gridworld environments... |
| Researcher Affiliation | Industry | Victoria Krakovna Deep Mind Laurent Orseau Deep Mind Richard Ngo Deep Mind Miljan Martic Deep Mind Shane Legg Deep Mind |
| Pseudocode | Yes | Algorithm 1 Basic future task approach [...] Algorithm 2 Future task approach with a baseline |
| Open Source Code | Yes | 2Code: github.com/deepmind/deepmind-research/tree/master/side_effects_penalties |
| Open Datasets | No | The paper describes custom 'gridworld environments' (Sushi, Vase, Box, Soko-coin) but does not provide concrete access information (link, DOI, formal citation) for these as publicly available datasets. |
| Dataset Splits | No | The paper describes using a grid search for hyperparameters and sampling goal states, but it does not provide specific train/validation/test dataset splits (e.g., percentages or counts) for the custom gridworld environments used in the experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments. It only mentions the runtime of the UVFA agent. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | We approximate the future task auxiliary reward using a sample of 10 possible future tasks. We approximate the baseline policy by sampling from the agent s experience of the outcome of the noop action... The UVFA network computes the value function given a goal state... It consists of two sub-networks... The two networks have one hidden layer of size 30 and an output layer of size 5. This configuration was chosen using a hyperparameter search... Exact and UVFA agents have discount rates of 0.99 and 0.95 respectively. For each agent, we do a grid search over the scaling parameter β (0.3, 1, 3, 10, 30, 100, 300, 1000)... We anneal the exploration rate linearly from 1 to 0, and keep it at 0 for the last 1000 episodes. We run the agents for 50K episodes in all environments except Soko-coin, where we run exact agents for 1M episodes and UVFA agents for 100K episodes. |