Environment Probing Interaction Policies

Authors: Wenxuan Zhou, Lerrel Pinto, Abhinav Gupta

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally show that EPI-conditioned task-specific policies significantly outperform commonly used policy generalization methods on novel testing environments.
Researcher Affiliation Collaboration Wenxuan Zhou1, Lerrel Pinto1, Abhinav Gupta1,2 1The Robotics Institute, Carnegie Mellon University 2Facebook AI Research
Pseudocode No No pseudocode or clearly labeled algorithm block was found.
Open Source Code Yes Code is available at https://github.com/Wenxuan-Zhou/EPI.
Open Datasets Yes For this, we use the Striker and the Hopper Mu Jo Co (Todorov et al., 2012) environments from Open AI Gym (Brockman et al., 2016).
Dataset Splits Yes To train our prediction models, a dataset of transition data (st, at, st+1) is collected in the training environments using a pre-trained task policy (Sec. 4.1.3). This data is split into a training set and a validation set.
Hardware Specification No No specific hardware details (like GPU/CPU models or cloud instance types) used for experiments were mentioned.
Software Dependencies No The paper mentions optimization by Adam and TRPO with rllab implementation, but does not provide specific version numbers for these software components.
Experiment Setup Yes An EPI-trajectory contains 10 steps of observations and actions for both Hopper and Striker. The embedding network ψ ... has two fully connected layers with 32 neurons each... The prediction models ... has four fully connected layers with 128 neurons each... The EPI-policy is trained for 200 400 iterations in total with a batch size of 10000 timesteps. The task policy will then use the trained EPI-policy and the embedding network to update for 1000 iterations with a batch size of 100000 timesteps.