An Alternative Softmax Operator for Reinforcement Learning

Authors: Kavosh Asadi, Michael L. Littman

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice. and We present three additional experiments.
Researcher Affiliation Academia 1Brown University, USA.
Pseudocode Yes Algorithm 1 SARSA with Boltzmann softmax policy
Open Source Code No No statement providing concrete access to the source code for the methodology described in this paper was found.
Open Datasets Yes We used the lunar lander domain, from Open AI Gym (Brockman et al., 2016) as our benchmark. and We evaluated SARSA on the multi-passenger taxi domain introduced by Dearden et al. (1998).
Dataset Splits No The paper does not provide specific details on training, validation, or test dataset splits. It mentions using environments like Lunar Lander and Multi-passenger Taxi, but no explicit data partitioning information is given.
Hardware Specification No No specific hardware details (such as GPU or CPU models, memory specifications, or cloud instance types) used for running the experiments are mentioned in the paper.
Software Dependencies No We used Keras (Chollet, 2015) and Theano (Team et al., 2016) to implement the neural network architecture. (The years refer to publication dates, not software versions used for the experiment).
Experiment Setup Yes We used the Adam algorithm (Kingma & Ba, 2014) with α = 0.005 and the other parameters as suggested by the paper. A batch episode size of 10 was used, as we had stability issues with smaller episode batch sizes.