Learning Task-Distribution Reward Shaping with Meta-Learning

Authors: Haosheng Zou, Tongzheng Ren, Dong Yan, Hang Su, Jun Zhu11210-11218

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments This sections seeks to answer two key questions: 1) what exactly is being learned and 2) how our framework compares to relevant baselines. We experimented on (of increasing complexity) Cart Poles, grid mazes and Coin Run games through comparison with the only previous baselines (Konidaris and Barto 2006; Snel and Whiteson 2014) and other strong competitors (incl. RAINBOW (Hessel et al. 2018), handdesigned shaping, MQL (Fakoor et al. 2020) and intrinsic rewards (Zheng et al. 2020)).
Researcher Affiliation Academia Haosheng Zou1, Tongzheng Ren2, Dong Yan1, Hang Su1, Jun Zhu1* 1Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Lab, Bosch-Tsinghua Joint ML Center, Tsinghua University 2Department of Computer Science, UT Austin
Pseudocode Yes Algorithm 1 Meta-learning potential function prior and Algorithm 2 Meta-testing (adaptation with advantage head)
Open Source Code No The paper does not contain any explicit statement about releasing the source code or a link to a code repository for the methodology described.
Open Datasets Yes The task distribution on Cart Poles is defined varying the pole length in the range of [0.25, 5.00]. The task distribution on grid mazes is defined on all possible maps of size 8 8, which is of exponentially many configurations varying start, goal and obstacles. The task distribution on Coin Run games (Cobbe et al. 2019) is defined on all possible level configurations.
Dataset Splits No The paper mentions meta-training and meta-testing sets, but does not explicitly describe a separate validation set or its split for hyperparameter tuning in the context of dataset partitioning.
Hardware Specification No The paper discusses computational speed but does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance types used for experiments.
Software Dependencies No The paper mentions various algorithms and models (e.g., Dueling DQN, MD3QN) but does not list specific software dependencies with version numbers, such as Python version or library versions (e.g., TensorFlow, PyTorch).
Experiment Setup Yes We meta-train on 500 sampled tasks for 200 meta iterations with 10 tasks per iteration and meta-test on 40 unseen tasks. The m D3QN uses an MLP with two hidden layers of size 32 before the value heads. We meta-train on 1000 sampled maps for 1000 meta iterations and meta-test on 40 unseen maps. The m D3QN uses a CNN similar as (Mnih et al. 2015) before the value heads. We meta-train on 2000 generated levels for 1500 meta iterations.