Inverse Reinforcement Learning from a Gradient-based Learner

Authors: Giorgia Ramponi, Gianluca Drappo, Marcello Restelli

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we evaluate the approach in a simulated Grid World environment and on the Mu Jo Co environments, comparing it with the state-of-the-art baseline.
Researcher Affiliation Academia Giorgia Ramponi Politecnico Di Milano Milan, Italy giorgia.ramponi@polimi.it Gianluca Drappo Politecnico Di Milano Milan, Italy gianluca.drappo@mail.polimi.it Marcello Restelli Politecnico Di Milano Milan, Italy marcello.restelli@polimi.it
Pseudocode Yes Algorithm 1 LOGEL
Open Source Code No The paper does not provide any specific links or statements regarding the release of the source code for the methodology described.
Open Datasets No The paper refers to using "Mu Jo Co environments" and a "Gridworld environment" which are common simulation platforms, and data is generated as "a dataset D = (D1, . . . , Dm+1) of trajectories generated by each policy". However, it does not provide concrete access information (e.g., specific links, DOIs, or citations to pre-existing, publicly available datasets) for the data used in the experiments.
Dataset Splits No The paper does not explicitly provide details about training, validation, and test dataset splits with specific percentages or sample counts, nor does it reference predefined splits with citations.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions using "Policy Proximal Optimization (PPO)" and "Mu Jo Co control suite" but does not specify version numbers for these or other software dependencies, which are necessary for reproducibility.
Experiment Setup Yes The learner is trained using Policy Proximal Optimization (PPO) [31]3, with 16 parallel agents for each learning step. For each step, the length of the trajectories is 2000. The reward for the Reacher environment is a 26-grid radial basis function that describes the distance between the agent and the goal, plus the 2-norm squared of the action. In the Hopper environment, instead, the reward features are the distance between the previous and the current position and the 2-norm squared of the action.