Reinforcement Learning with Dynamic Boltzmann Softmax Updates

Authors: Ling Pan, Qingpeng Cai, Qi Meng, Wei Chen, Longbo Huang

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on Grid World show that the DBS operator enables better estimation of the value function, which rectifies the convergence issue of the softmax operator. Finally, we propose the DBS-DQN algorithm by applying the DBS operator, which outperforms DQN substantially in 40 out of 49 Atari games.
Researcher Affiliation Collaboration Ling Pan1 , Qingpeng Cai2 , Qi Meng3 , Wei Chen3 , Longbo Huang1 1IIIS, Tsinghua University 2Alibaba Group 3Microsoft Research
Pseudocode Yes Algorithm 1 DBS Deep Q-Network
Open Source Code No The paper does not provide concrete access to source code for the methodology described, nor does it explicitly state that the code is open-source or available.
Open Datasets Yes We first evaluate DBS value iteration and DBS Q-learning on a tabular game, the Grid World. We then evaluate the DBS-DQN algorithm on 49 Atari video games from the Arcade Learning Environment [Bellemare et al., 2013], a standard challenging benchmark for deep reinforcement learning algorithms, by comparing it with DQN.
Dataset Splits No The paper uses RL environments (Grid World, Atari games) where explicit train/validation/test dataset splits (percentages or sample counts) are not typically defined as in supervised learning tasks. It describes training for 50M steps and evaluating performance through human normalized scores, but not specific dataset splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup Yes For fair comparison, we use the same setup of network architectures and hyper-parameters as in [Mnih et al., 2015] for both DQN and DBS-DQN.