Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning Self-Imitating Diverse Policies
Authors: Tanmay Gangwani, Qiang Liu, Jian Peng
ICLR 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. |
| Researcher Affiliation | Academia | Tanmay Gangwani Dept. of Computer Science UIUC EMAIL Qiang Liu Dept. of Computer Science UT Austin EMAIL Jian Peng Dept. of Computer Science UIUC EMAIL |
| Pseudocode | Yes | Our full algorithm is summarized in Appendix 5.3 (Algorithm 2). Notation: θ = Policy parameters φ = Discriminator parameters r(s, a) = Environment reward Algorithm 1: Notation: θi = Policy parameters for rank i φi = Self-imitation discriminator parameters for rank i ψi = Empirical density network parameters for rank i Algorithm 2: |
| Open Source Code | No | The paper mentions using code provided by authors of other papers (e.g., 'We use the code provided by the authors 2https://github.com/junhyukoh/self-imitation-learning') but does not provide a link or explicit statement for their own code release. |
| Open Datasets | Yes | We benchmark high-dimensional, continuous-control locomotion tasks based on the Mu Jo Co physics simulator by extending the Open AI Baselines (Dhariwal et al., 2017) framework. |
| Dataset Splits | No | The paper does not explicitly specify train/validation/test dataset splits with percentages, sample counts, or citations to predefined splits. It discusses interaction with environments over timesteps and ablation studies, but not explicit dataset splitting. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using the 'Open AI Baselines (Dhariwal et al., 2017) framework' and 'PPO algorithm (Schulman et al., 2017b)' but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | Further implementation details are in the Appendix. All runs use 5M timesteps of interaction with the environment. Horizon (T) = 1000 (locomotion), 250 (Maze), 5000 (Swimming+Gathering) Discount (γ) = 0.99 GAE parameter (λ) = 0.95 PPO internal epochs = 5 PPO learning rate = 1e-4 PPO mini-batch = 64 |