Discovering Generalizable Spatial Goal Representations via Graph-based Active Reward Learning
Authors: Aviv Netanyahu, Tianmin Shu, Joshua Tenenbaum, Pulkit Agrawal
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted experiments with simulated oracles and with human subjects. |
| Researcher Affiliation | Academia | 1 Dept. of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 2 Dept. of Brain and Cognitive Science, Massachusetts Institute of Technology, Cambridge, MA. |
| Pseudocode | Yes | Algorithm 1 Active Reward Refinement |
| Open Source Code | Yes | 1Project website: https://www.tshu.io/GEM |
| Open Datasets | No | We propose a one-shot imitation learning environment, Watch&Move. ... We design 9 object rearrangement tasks in the Watch&Move environment... Expert demonstrations were created with a planner introduced in (Netanyahu et al., 2021), with a length ranging from 8 to 35 steps. The paper does not provide concrete access information (link, DOI, repository, or formal citation for the dataset itself) for the Watch&Move tasks/demonstrations. |
| Dataset Splits | No | The paper describes "training sets" (SD, S+, S-) used in its active reward refinement, but these are dynamically collected during the learning process and not conventional static dataset splits (e.g., percentages or counts) for a predefined dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | We use Py Box2D to simulate the physical dynamics in the environment. ... For optimizing the network, we use Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.0003. ... We build upon an AIRL implementation (Wang et al., 2020)... The paper mentions software tools like Py Box2D, Adam optimizer, and an AIRL implementation but does not specify their version numbers. |
| Experiment Setup | Yes | M-AIRL is executed for 500k generator steps, the expert batch size is the length of the expert demonstrations. For the model-based policy, we set β = 0.3 in Eq. (4). The discriminator is updated for 4 steps after every model-based generator execution. ... We apply 5k network updates per query iteration. For optimizing the network, we use Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.0003. For each update, we sample a batch of 16 states for the regression loss and a batch of 16 pairs of positive and negative states for the reward ranking loss. ... we set λ = 0.2 in Eq 8. |