PAE: Reinforcement Learning from External Knowledge for Efficient Exploration

Authors: Zhe Wu, Haofei Lu, Junliang Xing, You Wu, Renye Yan, Yaozhong Gan, Yuanchun Shi

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments across 11 challenging tasks from the Baby AI and Mini Hack environment suites demonstrate PAE s superior exploration efficiency with good interpretability.
Researcher Affiliation Academia Zhe Wu 1, Haofei Lu 2, Junliang Xing 2, You Wu1,3, Renye Yan1,4, Yaozhong Gan 1, Yuanchun Shi1,2 1Qi Yuan Lab, 2Department of Computer Science and Technology, Tsinghua University 3Nanjing University, 4Peking University
Pseudocode Yes Algorithm 1 in Appendix A.4.1 summarizes our PAE procedure.
Open Source Code No The paper states 'We implemented the PAE and reproduction baseline algorithms using Torch Beast2 (K uttler et al., 2019)' and provides a footnote link to the TorchBeast GitHub repository (https://github.com/facebookresearch/torchbeast). However, this is a link to the third-party framework used, not to the authors' specific implementation code for PAE.
Open Datasets Yes We evaluated our method across two task types, totaling six environments within the Baby AI environment: Key Corridor tasks (KEYCORRS3R3, KEYCORRS4R3, KEYCORRS5R3) and Obstructed Maze tasks (OBSTRMAZE1DL, OBSTRMAZE2DLHB, OBSTRMAZE1Q). ...we extended our PAE approach to the more challenging Mini Hack tasks. Mini Hack consists of procedurally generated tasks... Our comparisons of PAE with baseline methods encompassed five Mini Hack environments: LAVACROSS-RING, LAVACROSS-POTION, LAVACROSS-FULL, RIVER-MONSTER, and MULTIROOM-N4-MONSTER. See Appendix A.2 for more details.
Dataset Splits No The paper does not explicitly provide specific percentages or counts for training, validation, and test dataset splits. It describes the environments used for evaluation but not the data partitioning methodology in detail.
Hardware Specification Yes Each model was trained using five independent seeds on a system with 112 Intel Xeon Platinum 8280 cores and 6 Nvidia RTX 3090 GPUs.
Software Dependencies Yes We implemented the PAE and reproduction baseline algorithms using Torch Beast2 (K uttler et al., 2019)... Meanwhile, the Planner utilized a pre-trained BERT model with frozen parameters to understand the semantics and encode knowledge.
Experiment Setup Yes For PAE, we ran a grid search over batch size {8, 32, 150}, unroll length {20, 40, 100, 200}, entropy cost for the Actor {0.0001, 0.0005, 0.001}, the Actor learning rate {0.0001, 0.0005, 0.001}, the Planner learning rate {0.0001, 0.0005, 0.001}, entropy cost for the Planner {0.001, 0.005, 0.01}. Table 5 shows the best parameters obtained from the search.