Mega-Reward: Achieving Human-Level Play without Extrinsic Rewards

Authors: Yuhang Song, Jianyi Wang, Thomas Lukasiewicz, Zhenghua Xu, Shangtong Zhang, Andrzej Wojcicki, Mai Xu5826-5833

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental studies show that mega-reward (i) can greatly outperform all state-of-the-art intrinsic reward approaches, (ii) generally achieves the same level of performance as Ex-PPO and professional human-level scores, and (iii) has also a superior performance when it is incorporated with extrinsic rewards.
Researcher Affiliation Collaboration Yuhang Song,1 Jianyi Wang,3 Thomas Lukasiewicz,1 Zhenghua Xu,1,2 Shangtong Zhang,1 Andrzej Wojcicki,4 Mai Xu3 1Department of Computer Science, University of Oxford, United Kingdom 2State Key Laboratory of Reliability and Intelligence of Electrical Equipment, Hebei University of Technology, China 3School of Electronic and Information Engineering, Beihang University, China 4Lighthouse
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Easy-to-run code is released in https://github.com/Yuhang Song/Mega-Reward.
Open Datasets Yes Extensive experimental studies have been conducted on 18 Atari games and the noisy TV domain (Burda et al. 2018); ... As the performance of professional human players (i.e., professional human-player scores) on 16 out of 18 Atari games have already been measured by (Mnih et al. 2015)
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. It mentions using Atari games and evaluating over the last 50 episodes, but no explicit train/validation/test splits are detailed.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions implementation details like PPO and environment wrapping libraries but does not provide specific ancillary software details with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes Mega-reward is implemented on PPO in (Schulman et al. 2017) with the same set of hyper-parameters, along with H W = 4 4 and ρ = 0.99. ... Here, all agents are run for 80M steps, with the last 50 episodes averaged as the final scores and reported in Table 1.