Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Environment Probing Interaction Policies
Authors: Wenxuan Zhou, Lerrel Pinto, Abhinav Gupta
ICLR 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally show that EPI-conditioned task-specific policies significantly outperform commonly used policy generalization methods on novel testing environments. |
| Researcher Affiliation | Collaboration | Wenxuan Zhou1, Lerrel Pinto1, Abhinav Gupta1,2 1The Robotics Institute, Carnegie Mellon University 2Facebook AI Research |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found. |
| Open Source Code | Yes | Code is available at https://github.com/Wenxuan-Zhou/EPI. |
| Open Datasets | Yes | For this, we use the Striker and the Hopper Mu Jo Co (Todorov et al., 2012) environments from Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | Yes | To train our prediction models, a dataset of transition data (st, at, st+1) is collected in the training environments using a pre-trained task policy (Sec. 4.1.3). This data is split into a training set and a validation set. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models or cloud instance types) used for experiments were mentioned. |
| Software Dependencies | No | The paper mentions optimization by Adam and TRPO with rllab implementation, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | An EPI-trajectory contains 10 steps of observations and actions for both Hopper and Striker. The embedding network ψ ... has two fully connected layers with 32 neurons each... The prediction models ... has four fully connected layers with 128 neurons each... The EPI-policy is trained for 200 400 iterations in total with a batch size of 10000 timesteps. The task policy will then use the trained EPI-policy and the embedding network to update for 1000 iterations with a batch size of 100000 timesteps. |