Inverse Reinforcement Learning from a Gradient-based Learner
Authors: Giorgia Ramponi, Gianluca Drappo, Marcello Restelli
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we evaluate the approach in a simulated Grid World environment and on the Mu Jo Co environments, comparing it with the state-of-the-art baseline. |
| Researcher Affiliation | Academia | Giorgia Ramponi Politecnico Di Milano Milan, Italy giorgia.ramponi@polimi.it Gianluca Drappo Politecnico Di Milano Milan, Italy gianluca.drappo@mail.polimi.it Marcello Restelli Politecnico Di Milano Milan, Italy marcello.restelli@polimi.it |
| Pseudocode | Yes | Algorithm 1 LOGEL |
| Open Source Code | No | The paper does not provide any specific links or statements regarding the release of the source code for the methodology described. |
| Open Datasets | No | The paper refers to using "Mu Jo Co environments" and a "Gridworld environment" which are common simulation platforms, and data is generated as "a dataset D = (D1, . . . , Dm+1) of trajectories generated by each policy". However, it does not provide concrete access information (e.g., specific links, DOIs, or citations to pre-existing, publicly available datasets) for the data used in the experiments. |
| Dataset Splits | No | The paper does not explicitly provide details about training, validation, and test dataset splits with specific percentages or sample counts, nor does it reference predefined splits with citations. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Policy Proximal Optimization (PPO)" and "Mu Jo Co control suite" but does not specify version numbers for these or other software dependencies, which are necessary for reproducibility. |
| Experiment Setup | Yes | The learner is trained using Policy Proximal Optimization (PPO) [31]3, with 16 parallel agents for each learning step. For each step, the length of the trajectories is 2000. The reward for the Reacher environment is a 26-grid radial basis function that describes the distance between the agent and the goal, plus the 2-norm squared of the action. In the Hopper environment, instead, the reward features are the distance between the previous and the current position and the 2-norm squared of the action. |