DeepTOP: Deep Threshold-Optimal Policy for MDPs and RMABs
Authors: Khaled Nakhleh, I-Hong Hou
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Simulation results show that our policy significantly outperforms other reinforcement learning algorithms due to its ability to exploit the monotone property. In addition, we show that the Whittle index, a powerful tool for restless multi-armed bandit problems, is equivalent to the optimal threshold policy for an alternative problem. This observation leads to a simple algorithm that finds the Whittle index by learning the optimal threshold policy in the alternative problem. Simulation results show that our algorithm learns the Whittle index much faster than several recent studies that learn the Whittle index through indirect means. |
| Researcher Affiliation | Academia | Khaled Nakhleh I-Hong Hou Electrical and Computer Engineering Department Texas A&M University College Station, TX {khaled.jamal, ihou}@tamu.edu |
| Pseudocode | Yes | Algorithm 1 Deep Threshold Optimal Policy Training for MDPs (Deep TOP-MDP) |
| Open Source Code | Yes | All source code can be found in the repository https://github.com/khalednakhleh/deeptop. |
| Open Datasets | No | The paper describes the construction and extension of various control problems (e.g., EV charging, inventory management, one-dimensional bandits) for simulation. It refers to a problem being 'based on' or 'extended from' previous work but does not provide concrete access information (link, DOI, specific repository, or formal citation for a public dataset) for the specific simulation parameters or dataset instances used for training. |
| Dataset Splits | No | The paper describes filling an agent's memory with transitions and then evaluating performance over timesteps in simulated environments, which is typical for reinforcement learning. However, it does not explicitly provide information on dataset splits (e.g., percentages or sample counts) for traditional training, validation, and test sets, as its experimental setup is based on continuous interaction with simulated environments rather than static datasets. |
| Hardware Specification | No | The paper states that hardware details are in Appendix D ('Did you include the total amount of compute and the type of resources used...? [Yes] see Appendix D.'). However, Appendix D is not included in the provided text, so specific hardware details cannot be found. |
| Software Dependencies | No | The paper mentions that training parameters and hyper-parameters can be found in Appendix D ('Did you specify all the training details...? [Yes] see Appendix D.'). However, Appendix D is not included in the provided text, so specific software dependencies with version numbers cannot be found. |
| Experiment Setup | No | The paper states that details about the training parameters (which typically include hyperparameters) can be found in Appendix D ('Details about the training parameters can be found in Appendix D.'). However, Appendix D is not included in the provided text, so specific experimental setup details are not present in the main text. |