A Boosting Approach to Reinforcement Learning
Authors: Nataly Brukhim, Elad Hazan, Karan Singh
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate our results, we check if the proposed algorithm is indeed capable of boosting the accuracy of concrete instantiations of weak learners. We evaluated these on the Cart Pole and the Lunar Lander environments. The results demonstrate the proposed RL boosting algorithm succeeds in maximizing rewards while using few weak learners (equivalently, within a few rounds of boosting). |
| Researcher Affiliation | Collaboration | Nataly Brukhim Princeton University nbrukhim@cs.princeton.edu Elad Hazan Princeton University Google AI Princeton ehazan@cs.princeton.edu Karan Singh Carnegie Mellon University karansingh@cmu.edu |
| Pseudocode | Yes | Algorithm 1 RL Boosting, Algorithm 2 Internal Boost, and Algorithm 3 Trajectory Sampler are presented in the paper. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | No | We evaluated these on the Cart Pole and the Lunar Lander environments. The paper refers to environments that generate data, not fixed datasets with access information or citations. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits. It only mentions 'reward is computed over 100 episodes of interactions' which is for evaluation, not data partitioning for model training. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments, nor does it mention specific GPU/CPU models or cloud resources. |
| Software Dependencies | No | The paper mentions 'Scikit-Learn [30]' but does not provide a specific version number for this or any other software dependency. |
| Experiment Setup | Yes | Throughout all the experiments, we used = 0.9. To speed up computation, the plots below were generated by retaining the 3 most recent policies of every iteration in the policy mixture. |