reproducibilityindex.ai

Stochastic Q-learning for Large Discrete Action Spaces

Authors: Fares Fourati, Vaneet Aggarwal, Mohamed-Slim Alouini

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Moreover, through empirical validation, we illustrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.
Researcher Affiliation	Academia	1Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, KSA. 2School of Industrial Engineering, Purdue University, West Lafayette, IN 47907, USA.
Pseudocode	Yes	We introduce Stochastic Q-learning, described in Algorithm 1, and Stochastic Double Q-learning, described in Algorithm 2 in Appendix C, that replace the max and arg max operations in Q-learning and Double Q-learning with stoch max and stoch arg max, respectively.
Open Source Code	No	The paper does not state that the code for their proposed methods is open-source nor provides a link to it. It only mentions using the Stable-Baselines library for comparison algorithms.
Open Datasets	Yes	We test our proposed algorithms on a standardized set of environments using open-source libraries. We compare stochastic maximization to exact maximization and evaluate the proposed stochastic RL algorithms on Gymnasium environments (Brockman et al., 2016) and Mu Jo Co (Todorov et al., 2012) environments.
Dataset Splits	No	The paper does not provide explicit training/validation/test dataset splits. Experiments are conducted in reinforcement learning environments (Gymnasium, MuJoCo) where data is generated through interaction, rather than using static, pre-split datasets.
Hardware Specification	Yes	We test the training time using a CPU 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 1.69 GHz. with 16.0 GB RAM.
Software Dependencies	Yes	We implement the different Q-learning methods using Python 3.9, Numpy 1.23.4, and Pytorch 2.0.1.
Experiment Setup	Yes	We set the discount factor γ to 0.95 and apply a dynamical polynomial learning rate α with αt(s, a) = 1/zt(s, a)0.8, where zt(s, a) is the number of times the pair (s, a) has been visited, initially set to one for all the pairs. For the exploration rate, we use use a decaying ε, defined as ε(s) = 1/ p (z(s)) where z(s) is the number of times state s has been visited, initially set to one for all the states.