Learning Self-Imitating Diverse Policies
Authors: Tanmay Gangwani, Qiang Liu, Jian Peng
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. |
| Researcher Affiliation | Academia | Tanmay Gangwani Dept. of Computer Science UIUC gangwan2@uiuc.edu Qiang Liu Dept. of Computer Science UT Austin lqiang@cs.utexas.edu Jian Peng Dept. of Computer Science UIUC jianpeng@uiuc.edu |
| Pseudocode | Yes | Our full algorithm is summarized in Appendix 5.3 (Algorithm 2). Notation: θ = Policy parameters φ = Discriminator parameters r(s, a) = Environment reward Algorithm 1: Notation: θi = Policy parameters for rank i φi = Self-imitation discriminator parameters for rank i ψi = Empirical density network parameters for rank i Algorithm 2: |
| Open Source Code | No | The paper mentions using code provided by authors of other papers (e.g., 'We use the code provided by the authors 2https://github.com/junhyukoh/self-imitation-learning') but does not provide a link or explicit statement for their own code release. |
| Open Datasets | Yes | We benchmark high-dimensional, continuous-control locomotion tasks based on the Mu Jo Co physics simulator by extending the Open AI Baselines (Dhariwal et al., 2017) framework. |
| Dataset Splits | No | The paper does not explicitly specify train/validation/test dataset splits with percentages, sample counts, or citations to predefined splits. It discusses interaction with environments over timesteps and ablation studies, but not explicit dataset splitting. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using the 'Open AI Baselines (Dhariwal et al., 2017) framework' and 'PPO algorithm (Schulman et al., 2017b)' but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | Further implementation details are in the Appendix. All runs use 5M timesteps of interaction with the environment. Horizon (T) = 1000 (locomotion), 250 (Maze), 5000 (Swimming+Gathering) Discount (γ) = 0.99 GAE parameter (λ) = 0.95 PPO internal epochs = 5 PPO learning rate = 1e-4 PPO mini-batch = 64 |