Oracle-Efficient Reinforcement Learning for Max Value Ensembles
Authors: Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Sengupta, Jessica Sorrell
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate our algorithm s experimental effectiveness and behavior on several robotic simulation testbeds. In this work we aim to compete with the max-following policy, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). |
| Researcher Affiliation | Academia | Marcel Hussing Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 mhussing@seas.upenn.edu Michael Kearns Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 mkearns@cis.upenn.edu Aaron Roth Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 aaroth@cis.upenn.edu Sikata Bela Sengupta Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 sikata@seas.upenn.edu Jessica Sorrell Dept. of Computer Science Johns Hopkins University Baltimore, MD 21218 jess@jhu.edu |
| Pseudocode | Yes | Algorithm 1 Max Iteration M α (Πk) |
| Open Source Code | Yes | An implementation of our code is appended in the supplemental material. |
| Open Datasets | Yes | A recent robotic simulation benchmark called Compo Suite [Mendez et al., 2022] and its corresponding offline datasets [Hussing et al., 2024] offer an instantiation of such a scenario. The datasets we used are available in an open access repository at https://datadryad.org/ stash/dataset/doi:10.5061/dryad.9cnp5hqps. |
| Dataset Splits | No | The paper mentions 'evaluation' episodes but does not specify train/validation/test dataset splits with percentages or sample counts for the datasets used. |
| Hardware Specification | Yes | Our experiments were conducted using a total of 17 GPUs inclusing both server-grade (e.g., NVIDIA RTX A6000s) and consumer-grade (e.g., NVIDIA RTX 3090) GPUs. |
| Software Dependencies | No | The paper mentions software like "d3rlpy implementations" and "Adam Optimizer" but does not specify version numbers for these or other key software components. |
| Experiment Setup | Yes | For practical purposes, we use a heuristic version of Max Iteration which does not re-compute the max-following policy at every step h but rather after multiple steps. Both algorithms are run for 10, 000 steps initially (to initialize value functions for Max Iteration and to pre-fill the buffer for IQL) before doing updates and then for 50, 000 steps for online training. Tables 1 and 2 detail the hyperparameters for Max Iteration and Implicit Q-Learning, respectively. |