Stochastic Q-learning for Large Discrete Action Spaces

Authors: Fares Fourati, Vaneet Aggarwal, Mohamed-Slim Alouini

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Moreover, through empirical validation, we illustrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.
Researcher Affiliation Academia 1Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, KSA. 2School of Industrial Engineering, Purdue University, West Lafayette, IN 47907, USA.
Pseudocode Yes We introduce Stochastic Q-learning, described in Algorithm 1, and Stochastic Double Q-learning, described in Algorithm 2 in Appendix C, that replace the max and arg max operations in Q-learning and Double Q-learning with stoch max and stoch arg max, respectively.
Open Source Code No The paper does not state that the code for their proposed methods is open-source nor provides a link to it. It only mentions using the Stable-Baselines library for comparison algorithms.
Open Datasets Yes We test our proposed algorithms on a standardized set of environments using open-source libraries. We compare stochastic maximization to exact maximization and evaluate the proposed stochastic RL algorithms on Gymnasium environments (Brockman et al., 2016) and Mu Jo Co (Todorov et al., 2012) environments.
Dataset Splits No The paper does not provide explicit training/validation/test dataset splits. Experiments are conducted in reinforcement learning environments (Gymnasium, MuJoCo) where data is generated through interaction, rather than using static, pre-split datasets.
Hardware Specification Yes We test the training time using a CPU 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 1.69 GHz. with 16.0 GB RAM.
Software Dependencies Yes We implement the different Q-learning methods using Python 3.9, Numpy 1.23.4, and Pytorch 2.0.1.
Experiment Setup Yes We set the discount factor γ to 0.95 and apply a dynamical polynomial learning rate α with αt(s, a) = 1/zt(s, a)0.8, where zt(s, a) is the number of times the pair (s, a) has been visited, initially set to one for all the pairs. For the exploration rate, we use use a decaying ε, defined as ε(s) = 1/ p (z(s)) where z(s) is the number of times state s has been visited, initially set to one for all the states.