Avoiding Side Effects By Considering Future Tasks

Authors: Victoria Krakovna, Laurent Orseau, Richard Ngo, Miljan Martic, Shane Legg

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then demonstrate the following on gridworld environments (Section 6): 1. Reversibility reward fails to avoid side effects if the current task requires irreversible actions. 2. Future task reward without a baseline policy shows interference behavior in a dynamic environment. 3. Future task reward with a baseline policy successfully avoids side effects and interference. [...] Table 1: Results on the gridworld environments...
Researcher Affiliation Industry Victoria Krakovna Deep Mind Laurent Orseau Deep Mind Richard Ngo Deep Mind Miljan Martic Deep Mind Shane Legg Deep Mind
Pseudocode Yes Algorithm 1 Basic future task approach [...] Algorithm 2 Future task approach with a baseline
Open Source Code Yes 2Code: github.com/deepmind/deepmind-research/tree/master/side_effects_penalties
Open Datasets No The paper describes custom 'gridworld environments' (Sushi, Vase, Box, Soko-coin) but does not provide concrete access information (link, DOI, formal citation) for these as publicly available datasets.
Dataset Splits No The paper describes using a grid search for hyperparameters and sampling goal states, but it does not provide specific train/validation/test dataset splits (e.g., percentages or counts) for the custom gridworld environments used in the experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments. It only mentions the runtime of the UVFA agent.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes We approximate the future task auxiliary reward using a sample of 10 possible future tasks. We approximate the baseline policy by sampling from the agent s experience of the outcome of the noop action... The UVFA network computes the value function given a goal state... It consists of two sub-networks... The two networks have one hidden layer of size 30 and an output layer of size 5. This configuration was chosen using a hyperparameter search... Exact and UVFA agents have discount rates of 0.99 and 0.95 respectively. For each agent, we do a grid search over the scaling parameter β (0.3, 1, 3, 10, 30, 100, 300, 1000)... We anneal the exploration rate linearly from 1 to 0, and keep it at 0 for the last 1000 episodes. We run the agents for 50K episodes in all environments except Soko-coin, where we run exact agents for 1M episodes and UVFA agents for 100K episodes.