Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution

Authors: Vihang Patil, Markus Hofmarcher, Marius-Constantin Dinu, Matthias Dorfer, Patrick M Blies, Johannes Brandstetter, José Arjona-Medina, Sepp Hochreiter

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Align-RUDDER is compared on three artificial tasks with sparse & delayed rewards and few demonstrations to Behavioral Cloning with Q-learning (BC+Q), Soft Q Imitation Learning (SQIL) (Reddy et al., 2020), RUDDER (LSTM), and Deep Q-learning from Demonstrations (DQf D) (Hester et al., 2018). Then, we test Align-RUDDER on the complex Minecraft Obtain Diamond task with episodic rewards (Guss et al., 2019b). All experiments use finite time MDPs with gamma = 1 and episodic reward.
Researcher Affiliation Collaboration 1ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz 2Dynatrace Research 3enlite AI 4Now at Microsoft Research 5Institute of Advanced Research in Artificial Intelligence.
Pseudocode No The paper describes methods and processes but does not include structured pseudocode or algorithm blocks with formal labels or formatting.
Open Source Code Yes Code is available at github.com/ ml-jku/align-rudder.
Open Datasets Yes We use a 1D key-chest environment to show the effectiveness of sequence alignment in a low data regime compared to an LSTM model. and Artificial tasks (I) and (II). They are variations of the gridworld rooms example (Sutton et al., 1999), where cells are the MDP states. and Then, we test Align-RUDDER on the complex Minecraft Obtain Diamond task with episodic rewards (Guss et al., 2019b).
Dataset Splits No The paper describes hyperparameter selection and training/testing, but does not explicitly provide specific train/validation/test dataset splits (percentages or counts) in a reproducible manner.
Hardware Specification Yes Artificial task (I) and (II) experiments were performed using CPU only as GPU speed-up was negligible. The final results for all methods were created on an internal CPU cluster with 128 CPU cores with a measured wall-clock time of 10,360 hours. ... For Minecraft, during development 6 to 8 nodes each with 4 GPUs of an internal GPU cluster were used for roughly six months of GPU compute time (Nvidia Titan V and 2080 TI).
Software Dependencies Yes We are thankful towards the developers of Mazelab (Zuo, 2018), Py Torch (Paszke et al., 2019), Open AI Gym (Brockman et al., 2016), Numpy (Harris et al., 2020), Matplotlib (Hunter, 2007) and Minecraft (Guss et al., 2019b).
Experiment Setup Yes For (BC)+Q-Learning and Align-RUDDER, we performed a grid search to select the learning rate from the following values [0.1, 0.05, 0.01]. ... For DQf D, we set the experience buffer size to 30, 000 and the number of experiences sampled at every timestep to 10. The DQf D loss weights are set to 0.01, 0.01 and 1.0 for the Q-learning loss term, n-step loss term and the expert loss respectively during pre-training. ... For successor representation, we use a learning rate of 0.1 and a gamma of 0.99. ... For affinity propagation, we use a damping factor of 0.5 and set the maximum number of iterations to 1000.