Recursive Reinforcement Learning

Authors: Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments We implemented Algorithm 1 in tabular form as well as with a neural network. For the tabular implementation, we quantized the vector v directly after its computation to ensure that the Qtable remains small and discrete. For the neural network implementation we used the techniques used in DQN [30], replay buffers and target networks, for additional stability. The details of this implementation can be found in the appendix. We consider three examples: one to demonstrate the application of Recursive Q-learning for synthesizing probabilistic programs, one to demonstrate convergence in the single-exit setting, and one to demonstrate the use of a context-free reward machine. We compare Recursive Q-learning to Q-learning where the RMDP is treated as an MDP by the agent, i.e., the agent treats stack calls and returns as if they were normal MDP transitions. Figure 3: Learning curves. Ten runs were performed for each method. The mean is shown as a dark line. The region between the 10th and 90th percentile is shaded.
Researcher Affiliation Academia Ernst Moritz Hahn University of Twente Mateo Perez University of Colorado Boulder Sven Schewe University of Liverpool Fabio Somenzi University of Colorado Boulder Ashutosh Trivedi University of Colorado Boulder Dominik Wojtczak University of Liverpool
Pseudocode Yes Algorithm 1: Recursive Q-learning and Algorithm 2: Recursive Q-learning (1-exit special case)
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See supplementary material.
Open Datasets No The paper describes custom experimental scenarios/environments (e.g., 'Cloud computing example', 'single-exit RMDP gridworld', 'Palindrome gridworld') but does not provide specific access information (like URLs, DOIs, or formal citations to existing public datasets) for the data used in these experiments.
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages, sample counts, or explicit methodology for splitting) for training, validation, or testing.
Hardware Specification Yes This work utilized the Summit supercomputer, which is supported by the National Science Foundation (awards ACI-1532235 and ACI-1532236), the University of Colorado Boulder, and Colorado State University.
Software Dependencies No The paper mentions techniques like 'DQN', 'replay buffers', and 'target networks', but does not provide specific software dependencies (e.g., library names with version numbers like PyTorch 1.9 or Python 3.8).
Experiment Setup No The paper states, 'The details of this implementation can be found in the appendix,' indicating that specific experimental setup details, such as hyperparameters, are deferred to supplementary materials and are not present in the main text.