Fast Adaptation to New Environments via Policy-Dynamics Value Functions

Authors: Roberta Raileanu, Max Goldstein, Arthur Szlam, Rob Fergus

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate PD-VF on four continuous control domains, and compare it with an upper bound, four baselines, and four ablations. For each domain, we create a number of environments with different dynamics. Then, we split the set of environments into training and test subsets, so that at test time, the agent has to find a policy that behaves well on unseen dynamics. For all our experiments, we show the mean and standard deviation of the average return (over 100 episodes) across 5 different seeds of each model.
Researcher Affiliation Collaboration Roberta Raileanu 1 Max Goldstein 1 Arthur Szlam 2 Rob Fergus 1 1Department of Computer Science, New York University, New York, USA 2Facebook AI Research, New York, USA.
Pseudocode No The paper describes the four phases of PD-VF conceptually but does not provide pseudocode or an algorithm block.
Open Source Code Yes Code available at policy-dynamics-value-functions.
Open Datasets Yes Swimmer is a family of environments with varying dynamics based on Mu Jo Co s Swimmer-v3 domain (Todorov et al., 2012). ... Ant-wind is a family of environments based on Mu Jo Co s Ant-v3 domain...
Dataset Splits Yes We split the set of environments into training and test subsets, so that at test time, the agent has to find a policy that behaves well on unseen dynamics. ... The 5 samples in the range [ 3/4 2π, . . . , 2π] are held out as evaluation environments, the rest being used for training. ... There are four test environments with both the leg and ankle lengths being either short or long.
Hardware Specification No The paper does not provide any specific details about the hardware used for experiments (e.g., GPU models, CPU types, memory).
Software Dependencies No The paper mentions using PPO (Schulman et al., 2017) and Adam (Kingma & Ba, 2014) but does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes The dynamics embeddings are inferred using at most Nd = 4 interactions with the environment. ... For a given environment, all methods use the same number of steps Nd (at the beginning of each episode) to infer the embedding of the environment dynamics.