Fast Adaptation to New Environments via Policy-Dynamics Value Functions
Authors: Roberta Raileanu, Max Goldstein, Arthur Szlam, Rob Fergus
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate PD-VF on four continuous control domains, and compare it with an upper bound, four baselines, and four ablations. For each domain, we create a number of environments with different dynamics. Then, we split the set of environments into training and test subsets, so that at test time, the agent has to find a policy that behaves well on unseen dynamics. For all our experiments, we show the mean and standard deviation of the average return (over 100 episodes) across 5 different seeds of each model. |
| Researcher Affiliation | Collaboration | Roberta Raileanu 1 Max Goldstein 1 Arthur Szlam 2 Rob Fergus 1 1Department of Computer Science, New York University, New York, USA 2Facebook AI Research, New York, USA. |
| Pseudocode | No | The paper describes the four phases of PD-VF conceptually but does not provide pseudocode or an algorithm block. |
| Open Source Code | Yes | Code available at policy-dynamics-value-functions. |
| Open Datasets | Yes | Swimmer is a family of environments with varying dynamics based on Mu Jo Co s Swimmer-v3 domain (Todorov et al., 2012). ... Ant-wind is a family of environments based on Mu Jo Co s Ant-v3 domain... |
| Dataset Splits | Yes | We split the set of environments into training and test subsets, so that at test time, the agent has to find a policy that behaves well on unseen dynamics. ... The 5 samples in the range [ 3/4 2π, . . . , 2π] are held out as evaluation environments, the rest being used for training. ... There are four test environments with both the leg and ankle lengths being either short or long. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for experiments (e.g., GPU models, CPU types, memory). |
| Software Dependencies | No | The paper mentions using PPO (Schulman et al., 2017) and Adam (Kingma & Ba, 2014) but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The dynamics embeddings are inferred using at most Nd = 4 interactions with the environment. ... For a given environment, all methods use the same number of steps Nd (at the beginning of each episode) to infer the embedding of the environment dynamics. |