Programmatic Reward Design by Example

Authors: Weichao Zhou, Wenchao Li9233-9241

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that programmatic reward functions learned using this framework can significantly outperform those learned using existing reward learning algorithms, and enable RL agents to achieve state-of-the-art performance on highly complex tasks.
Researcher Affiliation Academia Weichao Zhou and Wenchao Li Boston University {zwc662,wenchao}@bu.edu
Pseudocode Yes Algorithm 1: Generative Adversarial PRDBE
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described. It only references a third-party environment: 'https://github.com/maximecb/gym-minigrid'.
Open Datasets Yes We select from the Mini Grid environments (Chevalier-Boisvert, Willems, and Pal 2018) three challenging RL tasks... Chevalier-Boisvert, M.; Willems, L.; and Pal, S. 2018. Minimalistic Gridworld Environment for Open AI Gym. https: //github.com/maximecb/gym-minigrid. Accessed: 2022-0425.
Dataset Splits No The paper describes using a certain number of 'example trajectories' (e.g., '10 example trajectories demonstrated in a Door Key-8x8 environment') for learning, but does not specify explicit train/validation/test dataset splits with percentages or counts for reproducing data partitioning.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions several algorithms and environments (e.g., PPO, AGAC, Mini Grid) but does not provide specific version numbers for any ancillary software dependencies or libraries (e.g., Python, PyTorch, Gym versions) needed for reproduction.
Experiment Setup No The paper does not provide specific experimental setup details such as hyperparameter values (e.g., learning rates, batch sizes, number of epochs, optimizer settings) or detailed training configurations in the main text.