Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution
Authors: Vihang Patil, Markus Hofmarcher, Marius-Constantin Dinu, Matthias Dorfer, Patrick M Blies, Johannes Brandstetter, José Arjona-Medina, Sepp Hochreiter
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Align-RUDDER is compared on three artificial tasks with sparse & delayed rewards and few demonstrations to Behavioral Cloning with Q-learning (BC+Q), Soft Q Imitation Learning (SQIL) (Reddy et al., 2020), RUDDER (LSTM), and Deep Q-learning from Demonstrations (DQf D) (Hester et al., 2018). Then, we test Align-RUDDER on the complex Minecraft Obtain Diamond task with episodic rewards (Guss et al., 2019b). All experiments use finite time MDPs with gamma = 1 and episodic reward. |
| Researcher Affiliation | Collaboration | 1ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz 2Dynatrace Research 3enlite AI 4Now at Microsoft Research 5Institute of Advanced Research in Artificial Intelligence. |
| Pseudocode | No | The paper describes methods and processes but does not include structured pseudocode or algorithm blocks with formal labels or formatting. |
| Open Source Code | Yes | Code is available at github.com/ ml-jku/align-rudder. |
| Open Datasets | Yes | We use a 1D key-chest environment to show the effectiveness of sequence alignment in a low data regime compared to an LSTM model. and Artificial tasks (I) and (II). They are variations of the gridworld rooms example (Sutton et al., 1999), where cells are the MDP states. and Then, we test Align-RUDDER on the complex Minecraft Obtain Diamond task with episodic rewards (Guss et al., 2019b). |
| Dataset Splits | No | The paper describes hyperparameter selection and training/testing, but does not explicitly provide specific train/validation/test dataset splits (percentages or counts) in a reproducible manner. |
| Hardware Specification | Yes | Artificial task (I) and (II) experiments were performed using CPU only as GPU speed-up was negligible. The final results for all methods were created on an internal CPU cluster with 128 CPU cores with a measured wall-clock time of 10,360 hours. ... For Minecraft, during development 6 to 8 nodes each with 4 GPUs of an internal GPU cluster were used for roughly six months of GPU compute time (Nvidia Titan V and 2080 TI). |
| Software Dependencies | Yes | We are thankful towards the developers of Mazelab (Zuo, 2018), Py Torch (Paszke et al., 2019), Open AI Gym (Brockman et al., 2016), Numpy (Harris et al., 2020), Matplotlib (Hunter, 2007) and Minecraft (Guss et al., 2019b). |
| Experiment Setup | Yes | For (BC)+Q-Learning and Align-RUDDER, we performed a grid search to select the learning rate from the following values [0.1, 0.05, 0.01]. ... For DQf D, we set the experience buffer size to 30, 000 and the number of experiences sampled at every timestep to 10. The DQf D loss weights are set to 0.01, 0.01 and 1.0 for the Q-learning loss term, n-step loss term and the expert loss respectively during pre-training. ... For successor representation, we use a learning rate of 0.1 and a gamma of 0.99. ... For affinity propagation, we use a damping factor of 0.5 and set the maximum number of iterations to 1000. |