Provably Efficient Learning of Transferable Rewards

Authors: Alberto Maria Metelli, Giorgia Ramponi, Alessandro Concetti, Marcello Restelli

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we provide numerical simulations on benchmark environments.
Researcher Affiliation Academia Alberto Maria Metelli * 1 Giorgia Ramponi * 1 Alessandro Concetti 1 Marcello Restelli 1 1Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy. Correspondence to: Alberto Maria Metelli <albertomaria.metelli@polimi.it>.
Pseudocode Yes Algorithm 1 Uniform Sampling IRL [...] Algorithm 2 TRAVEL
Open Source Code No The paper does not provide any statements about the availability of open-source code for the described methodology or links to code repositories.
Open Datasets No The paper describes using a "3ˆ3 Gridworld environment" and "200 random generated MDPs" for experiments. However, it does not provide concrete access information (e.g., links, DOIs, formal citations with authors/year) for these environments or any other publicly available datasets used for training.
Dataset Splits No The paper does not provide specific dataset split information (e.g., exact percentages, sample counts, or citations to predefined splits) for training, validation, or testing data. It mentions "source MDP" and "target MDPs" but these refer to different environments rather than data splits of a single dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It does not mention any computing environments or machines beyond general descriptions of experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with versions) needed to replicate the experiments.
Experiment Setup Yes In Section 8, 'Experimental Evaluation', the paper describes the environments used: 'a 3ˆ3 Gridworld environment with an obstacle in the central cell that makes the agent bouncing back with probability p and surpassing it with probability 1 p. [...] The source MDP has obstacle s probability p 0.8 and target MDPs are four Gridworlds with obstacle s probabilities p Pt0,0.2,0.5,0.8u.' It also states the performance index: 'In all the experiments, we employ reward functions that depend on the state only and the algorithms are evaluated according to the following performance index }V M1Yr E V pπ M1Yr E}2 2.' Additionally, it mentions a 'chain MDP composed by 6 states S ts0,...,s4,sbu and 10 actions A tag,a1,...,a9u'.