Discovering Generalizable Spatial Goal Representations via Graph-based Active Reward Learning

Authors: Aviv Netanyahu, Tianmin Shu, Joshua Tenenbaum, Pulkit Agrawal

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments with simulated oracles and with human subjects.
Researcher Affiliation Academia 1 Dept. of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 2 Dept. of Brain and Cognitive Science, Massachusetts Institute of Technology, Cambridge, MA.
Pseudocode Yes Algorithm 1 Active Reward Refinement
Open Source Code Yes 1Project website: https://www.tshu.io/GEM
Open Datasets No We propose a one-shot imitation learning environment, Watch&Move. ... We design 9 object rearrangement tasks in the Watch&Move environment... Expert demonstrations were created with a planner introduced in (Netanyahu et al., 2021), with a length ranging from 8 to 35 steps. The paper does not provide concrete access information (link, DOI, repository, or formal citation for the dataset itself) for the Watch&Move tasks/demonstrations.
Dataset Splits No The paper describes "training sets" (SD, S+, S-) used in its active reward refinement, but these are dynamically collected during the learning process and not conventional static dataset splits (e.g., percentages or counts) for a predefined dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No We use Py Box2D to simulate the physical dynamics in the environment. ... For optimizing the network, we use Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.0003. ... We build upon an AIRL implementation (Wang et al., 2020)... The paper mentions software tools like Py Box2D, Adam optimizer, and an AIRL implementation but does not specify their version numbers.
Experiment Setup Yes M-AIRL is executed for 500k generator steps, the expert batch size is the length of the expert demonstrations. For the model-based policy, we set β = 0.3 in Eq. (4). The discriminator is updated for 4 steps after every model-based generator execution. ... We apply 5k network updates per query iteration. For optimizing the network, we use Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.0003. For each update, we sample a batch of 16 states for the regression loss and a batch of 16 pairs of positive and negative states for the reward ranking loss. ... we set λ = 0.2 in Eq 8.