reproducibilityindex.ai

Oracle-Efficient Reinforcement Learning for Max Value Ensembles

Authors: Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Sengupta, Jessica Sorrell

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate our algorithm s experimental effectiveness and behavior on several robotic simulation testbeds. In this work we aim to compete with the max-following policy, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions).
Researcher Affiliation	Academia	Marcel Hussing Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 mhussing@seas.upenn.edu Michael Kearns Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 mkearns@cis.upenn.edu Aaron Roth Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 aaroth@cis.upenn.edu Sikata Bela Sengupta Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 sikata@seas.upenn.edu Jessica Sorrell Dept. of Computer Science Johns Hopkins University Baltimore, MD 21218 jess@jhu.edu
Pseudocode	Yes	Algorithm 1 Max Iteration M α (Πk)
Open Source Code	Yes	An implementation of our code is appended in the supplemental material.
Open Datasets	Yes	A recent robotic simulation benchmark called Compo Suite [Mendez et al., 2022] and its corresponding offline datasets [Hussing et al., 2024] offer an instantiation of such a scenario. The datasets we used are available in an open access repository at https://datadryad.org/ stash/dataset/doi:10.5061/dryad.9cnp5hqps.
Dataset Splits	No	The paper mentions 'evaluation' episodes but does not specify train/validation/test dataset splits with percentages or sample counts for the datasets used.
Hardware Specification	Yes	Our experiments were conducted using a total of 17 GPUs inclusing both server-grade (e.g., NVIDIA RTX A6000s) and consumer-grade (e.g., NVIDIA RTX 3090) GPUs.
Software Dependencies	No	The paper mentions software like "d3rlpy implementations" and "Adam Optimizer" but does not specify version numbers for these or other key software components.
Experiment Setup	Yes	For practical purposes, we use a heuristic version of Max Iteration which does not re-compute the max-following policy at every step h but rather after multiple steps. Both algorithms are run for 10, 000 steps initially (to initialize value functions for Max Iteration and to pre-fill the buffer for IQL) before doing updates and then for 50, 000 steps for online training. Tables 1 and 2 detail the hyperparameters for Max Iteration and Implicit Q-Learning, respectively.