Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fast Adaptation to New Environments via Policy-Dynamics Value Functions
Authors: Roberta Raileanu, Max Goldstein, Arthur Szlam, Rob Fergus
ICML 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate PD-VF on four continuous control domains, and compare it with an upper bound, four baselines, and four ablations. For each domain, we create a number of environments with different dynamics. Then, we split the set of environments into training and test subsets, so that at test time, the agent has to find a policy that behaves well on unseen dynamics. For all our experiments, we show the mean and standard deviation of the average return (over 100 episodes) across 5 different seeds of each model. |
| Researcher Affiliation | Collaboration | Roberta Raileanu 1 Max Goldstein 1 Arthur Szlam 2 Rob Fergus 1 1Department of Computer Science, New York University, New York, USA 2Facebook AI Research, New York, USA. |
| Pseudocode | No | The paper describes the four phases of PD-VF conceptually but does not provide pseudocode or an algorithm block. |
| Open Source Code | Yes | Code available at policy-dynamics-value-functions. |
| Open Datasets | Yes | Swimmer is a family of environments with varying dynamics based on Mu Jo Co s Swimmer-v3 domain (Todorov et al., 2012). ... Ant-wind is a family of environments based on Mu Jo Co s Ant-v3 domain... |
| Dataset Splits | Yes | We split the set of environments into training and test subsets, so that at test time, the agent has to find a policy that behaves well on unseen dynamics. ... The 5 samples in the range [ 3/4 2π, . . . , 2π] are held out as evaluation environments, the rest being used for training. ... There are four test environments with both the leg and ankle lengths being either short or long. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for experiments (e.g., GPU models, CPU types, memory). |
| Software Dependencies | No | The paper mentions using PPO (Schulman et al., 2017) and Adam (Kingma & Ba, 2014) but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The dynamics embeddings are inferred using at most Nd = 4 interactions with the environment. ... For a given environment, all methods use the same number of steps Nd (at the beginning of each episode) to infer the embedding of the environment dynamics. |