reproducibilityindex.ai

Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Authors: Zhao Song, Ron Parr, Lawrence Carin

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We combine the softmax Bellman operator with the deep Q-network (DQN) (Mnih et al., 2015) and double DQN (DDQN) (van Hasselt et al., 2016a) algorithms, by replacing the max function therein with the softmax function, in the target network. We then test the variants on several games in the Arcade Learning Environment (ALE) (Bellemare et al., 2013), a standard large-scale deep RL testbed. The results show that the variants using the softmax Bellman operator can achieve higher test scores, and reduce the Q-value overestimation as well as the gradient noise on most of them.
Researcher Affiliation	Academia	Zhao Song 1 * Ronald E. Parr 1 Lawrence Carin 1 1Duke University
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is provided at https: //github.com/zhao-song/Softmax-DQN.
Open Datasets	Yes	We tested on six Atari games: Q*Bert, Ms. Pacman, Crazy Climber, Breakout, Asterix, and Seaquest. Our code is built on the Theano+Lasagne implementation from https: //github.com/spragunr/deep_q_rl/. The train- ing contains 200 epochs in total.
Dataset Splits	No	The paper states: "The test procedures and all the hyperparameters are set the same as DQN, with details described in Mnih et al. (2015)." However, it does not explicitly provide the train/validation/test split percentages or sample counts within this paper. It refers to an external paper for details.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies	No	Our code is built on the Theano+Lasagne implementation from https: //github.com/spragunr/deep_q_rl/. The paper mentions the software frameworks used (Theano, Lasagne) but does not provide specific version numbers for them or any other relevant libraries.
Experiment Setup	Yes	The test procedures and all the hyperparameters are set the same as DQN, with details described in Mnih et al. (2015). The inverse temperature parameter was selected based on a grid search over {1, 5, 10}. The training contains 200 epochs in total. The results statistics are obtained by running with ﬁve independent random seeds. The optimization of Eq. (1) is performed via RMSProp (Tieleman & Hinton, 2012), with mini-batches sampled from a replay buffer.