Inverse Reinforcement Learning with the Average Reward Criterion
Authors: Feiyang Wu, Jingyang Ke, Anqi Wu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we corroborate our analysis with numerical experiments using the Mu Jo Co benchmark and additional control tasks. Numerical experiments: Our RL and IRL methodologies have been tested against the well-known robotics manipulation benchmark, Mu Jo Co, as a means to substantiate our theoretical analysis. The results indicate that the proposed SPMD and IPMD algorithms generally outperform state-of-the-art algorithms. |
| Researcher Affiliation | Academia | Feiyang Wu feiyangwu@gatech.edu Jingyang Ke jingyang.ke@gatech.edu Anqi Wu anqiwu@gatech.edu School of Computational Science and Engineering College of Computing Georgia Institute of Technology Atlanta, Georgia 30332 |
| Pseudocode | Yes | Algorithm 1: The Stochastic Policy Mirror Descent (SPMD) algorithm for AMDPs Algorithm 2: The Inverse Policy Mirror Descent (IPMD) algorithm |
| Open Source Code | Yes | Our code can be found at https://anonymous.4open.science/r/IPMD-9D60. |
| Open Datasets | No | The paper uses the Mu Jo Co benchmark and environments (Hopper, Half-Cheetah, Walker, Ant, Humanoid, Pendulum, Lunar Lander Continuous). While these are standard, the paper does not provide concrete access information (link, DOI, formal citation) for the *datasets* used within these environments, nor specific details about how or where to obtain the exact data used for training and evaluation. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits). It mentions training on environments but not data splitting. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments (e.g., specific GPU models, CPU models, or cloud computing instances with specifications). |
| Software Dependencies | No | The paper mentions "stable-baselines3 [28]" but does not provide specific version numbers for it or any other software dependencies. It also mentions "Mu Jo Co" and "Openai gym" without versions. |
| Experiment Setup | Yes | During training, we found that setting the entropy coefficient term to 0.01 makes training stable and efficient. The learning rate is 3e 4. Each step of the algorithm samples 512 state-action sample pairs. A double Q-learning technique is used to minimize overestimation [13]. |