Recursive Reinforcement Learning
Authors: Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments We implemented Algorithm 1 in tabular form as well as with a neural network. For the tabular implementation, we quantized the vector v directly after its computation to ensure that the Qtable remains small and discrete. For the neural network implementation we used the techniques used in DQN [30], replay buffers and target networks, for additional stability. The details of this implementation can be found in the appendix. We consider three examples: one to demonstrate the application of Recursive Q-learning for synthesizing probabilistic programs, one to demonstrate convergence in the single-exit setting, and one to demonstrate the use of a context-free reward machine. We compare Recursive Q-learning to Q-learning where the RMDP is treated as an MDP by the agent, i.e., the agent treats stack calls and returns as if they were normal MDP transitions. Figure 3: Learning curves. Ten runs were performed for each method. The mean is shown as a dark line. The region between the 10th and 90th percentile is shaded. |
| Researcher Affiliation | Academia | Ernst Moritz Hahn University of Twente Mateo Perez University of Colorado Boulder Sven Schewe University of Liverpool Fabio Somenzi University of Colorado Boulder Ashutosh Trivedi University of Colorado Boulder Dominik Wojtczak University of Liverpool |
| Pseudocode | Yes | Algorithm 1: Recursive Q-learning and Algorithm 2: Recursive Q-learning (1-exit special case) |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See supplementary material. |
| Open Datasets | No | The paper describes custom experimental scenarios/environments (e.g., 'Cloud computing example', 'single-exit RMDP gridworld', 'Palindrome gridworld') but does not provide specific access information (like URLs, DOIs, or formal citations to existing public datasets) for the data used in these experiments. |
| Dataset Splits | No | The paper does not provide specific dataset split information (e.g., percentages, sample counts, or explicit methodology for splitting) for training, validation, or testing. |
| Hardware Specification | Yes | This work utilized the Summit supercomputer, which is supported by the National Science Foundation (awards ACI-1532235 and ACI-1532236), the University of Colorado Boulder, and Colorado State University. |
| Software Dependencies | No | The paper mentions techniques like 'DQN', 'replay buffers', and 'target networks', but does not provide specific software dependencies (e.g., library names with version numbers like PyTorch 1.9 or Python 3.8). |
| Experiment Setup | No | The paper states, 'The details of this implementation can be found in the appendix,' indicating that specific experimental setup details, such as hyperparameters, are deferred to supplementary materials and are not present in the main text. |