Environment Probing Interaction Policies
Authors: Wenxuan Zhou, Lerrel Pinto, Abhinav Gupta
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally show that EPI-conditioned task-specific policies significantly outperform commonly used policy generalization methods on novel testing environments. |
| Researcher Affiliation | Collaboration | Wenxuan Zhou1, Lerrel Pinto1, Abhinav Gupta1,2 1The Robotics Institute, Carnegie Mellon University 2Facebook AI Research |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found. |
| Open Source Code | Yes | Code is available at https://github.com/Wenxuan-Zhou/EPI. |
| Open Datasets | Yes | For this, we use the Striker and the Hopper Mu Jo Co (Todorov et al., 2012) environments from Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | Yes | To train our prediction models, a dataset of transition data (st, at, st+1) is collected in the training environments using a pre-trained task policy (Sec. 4.1.3). This data is split into a training set and a validation set. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models or cloud instance types) used for experiments were mentioned. |
| Software Dependencies | No | The paper mentions optimization by Adam and TRPO with rllab implementation, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | An EPI-trajectory contains 10 steps of observations and actions for both Hopper and Striker. The embedding network ψ ... has two fully connected layers with 32 neurons each... The prediction models ... has four fully connected layers with 128 neurons each... The EPI-policy is trained for 200 400 iterations in total with a batch size of 10000 timesteps. The task policy will then use the trained EPI-policy and the embedding network to update for 1000 iterations with a batch size of 100000 timesteps. |