Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Authors: Zhao Song, Ron Parr, Lawrence Carin

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We combine the softmax Bellman operator with the deep Q-network (DQN) (Mnih et al., 2015) and double DQN (DDQN) (van Hasselt et al., 2016a) algorithms, by replacing the max function therein with the softmax function, in the target network. We then test the variants on several games in the Arcade Learning Environment (ALE) (Bellemare et al., 2013), a standard large-scale deep RL testbed. The results show that the variants using the softmax Bellman operator can achieve higher test scores, and reduce the Q-value overestimation as well as the gradient noise on most of them.
Researcher Affiliation Academia Zhao Song 1 * Ronald E. Parr 1 Lawrence Carin 1 1Duke University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is provided at https: //github.com/zhao-song/Softmax-DQN.
Open Datasets Yes We tested on six Atari games: Q*Bert, Ms. Pacman, Crazy Climber, Breakout, Asterix, and Seaquest. Our code is built on the Theano+Lasagne implementation from https: //github.com/spragunr/deep_q_rl/. The train- ing contains 200 epochs in total.
Dataset Splits No The paper states: "The test procedures and all the hyperparameters are set the same as DQN, with details described in Mnih et al. (2015)." However, it does not explicitly provide the train/validation/test split percentages or sample counts within this paper. It refers to an external paper for details.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No Our code is built on the Theano+Lasagne implementation from https: //github.com/spragunr/deep_q_rl/. The paper mentions the software frameworks used (Theano, Lasagne) but does not provide specific version numbers for them or any other relevant libraries.
Experiment Setup Yes The test procedures and all the hyperparameters are set the same as DQN, with details described in Mnih et al. (2015). The inverse temperature parameter was selected based on a grid search over {1, 5, 10}. The training contains 200 epochs in total. The results statistics are obtained by running with five independent random seeds. The optimization of Eq. (1) is performed via RMSProp (Tieleman & Hinton, 2012), with mini-batches sampled from a replay buffer.