reproducibilityindex.ai

Inverse Reinforcement Learning with the Average Reward Criterion

Authors: Feiyang Wu, Jingyang Ke, Anqi Wu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we corroborate our analysis with numerical experiments using the Mu Jo Co benchmark and additional control tasks. Numerical experiments: Our RL and IRL methodologies have been tested against the well-known robotics manipulation benchmark, Mu Jo Co, as a means to substantiate our theoretical analysis. The results indicate that the proposed SPMD and IPMD algorithms generally outperform state-of-the-art algorithms.
Researcher Affiliation	Academia	Feiyang Wu feiyangwu@gatech.edu Jingyang Ke jingyang.ke@gatech.edu Anqi Wu anqiwu@gatech.edu School of Computational Science and Engineering College of Computing Georgia Institute of Technology Atlanta, Georgia 30332
Pseudocode	Yes	Algorithm 1: The Stochastic Policy Mirror Descent (SPMD) algorithm for AMDPs Algorithm 2: The Inverse Policy Mirror Descent (IPMD) algorithm
Open Source Code	Yes	Our code can be found at https://anonymous.4open.science/r/IPMD-9D60.
Open Datasets	No	The paper uses the Mu Jo Co benchmark and environments (Hopper, Half-Cheetah, Walker, Ant, Humanoid, Pendulum, Lunar Lander Continuous). While these are standard, the paper does not provide concrete access information (link, DOI, formal citation) for the datasets used within these environments, nor specific details about how or where to obtain the exact data used for training and evaluation.
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits). It mentions training on environments but not data splitting.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments (e.g., specific GPU models, CPU models, or cloud computing instances with specifications).
Software Dependencies	No	The paper mentions "stable-baselines3 [28]" but does not provide specific version numbers for it or any other software dependencies. It also mentions "Mu Jo Co" and "Openai gym" without versions.
Experiment Setup	Yes	During training, we found that setting the entropy coefﬁcient term to 0.01 makes training stable and efﬁcient. The learning rate is 3e 4. Each step of the algorithm samples 512 state-action sample pairs. A double Q-learning technique is used to minimize overestimation [13].