Augmenting Policy Learning with Routines Discovered from a Single Demonstration

Authors: Zelin Zhao, Chuang Gan, Jiajun Wu, Xiaoxiao Guo, Joshua B. Tenenbaum11024-11032

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Atari games demonstrate that RAPL improves the state-of-the-art imitation learning method SQIL and reinforcement learning method A2C.
Researcher Affiliation Collaboration 1 Shanghai Jiao Tong University 2 MIT-IBM Watson AI Lab 3 Stanford University 4 Massachusetts Institute of Technology
Pseudocode Yes The pseudocode of routine discovery is provided in the supplementary material.
Open Source Code Yes Our code is now available at https://github.com/sjtuytc/AAAI21Routine Augmented Policy Learning.
Open Datasets Yes Our experiments are conducted on the Atari benchmark (Bellemare et al. 2012) and Coin Run (Cobbe et al. 2018).
Dataset Splits Yes We train two agents by both A2C and RAPL-A2C on the same 100 easy levels. Then we test them on 100 unseen easy levels to test the generalization ability to unseen levels. After that, we test both agents on 100 hard levels to test the generalization ability across difficulties.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions software like Gym, Adam, and RMSProp, but does not provide specific version numbers for these or other relevant software libraries.
Experiment Setup Yes We use a frame-skip of 4, a frame-stack of 4, and the minimal action space (Bellemare et al. 2012). [...] We use λvalue = 0.5 and λentropy = 0.01 to balance the value loss and entropy loss accordingly. We set λprim = 1.0 when using routine augmentation. The optimizer is RMSProp with a learning rate 7 10 4, a linear decay of 10 5 per timestep. We use entropy regularization with β = 0.02. The return is calculated for N = 5 steps. Each agent is trained for 10 million steps. [...] In all experiments, we set the balancing factor between frequency and length to be λlength = 0.1. Moreover, the number of routines is set to K = 3. We would leave the best routine between routines whose Levenshtein distance is smaller than α = 2.