Constrained episodic reinforcement learning in concave-convex and knapsack settings

Authors: Kianté Brantley, Miro Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, Wen Sun

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that the proposed algorithm significantly outperforms these approaches in constrained episodic benchmarks.
Researcher Affiliation Collaboration Kianté Brantley University of Maryland kdbrant@cs.umd.edu; Miroslav Dudík Microsoft Research mdudik@microsoft.com; Thodoris Lykouris Microsoft Research thlykour@microsoft.com; Sobhan Miryoosefi Princeton University miryoosefi@cs.princeton.edu; Max Simchowitz UC Berkeley msimchow@berkeley.edu; Aleksandrs Slivkins Microsoft Research slivkins@microsoft.com; Wen Sun Cornell University ws455@cornell.edu
Pseudocode No The paper describes algorithms and their components (e.g., CONRL, CONPLANNER) and how to solve optimization problems as linear programs, but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/miryoosefi/Con RL
Open Datasets Yes We run our experiments on two grid-world environments Mars rover (Tessler et al., 2019) and Box (Leike et al., 2017).
Dataset Splits No The paper describes running experiments on grid-world environments and training over a number of trajectories, but it does not specify traditional dataset splits (e.g., training, validation, test percentages or counts) as commonly seen in supervised learning.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies No The paper does not specify software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup Yes The episode horizon H is 30 and the agent s action is perturbed with probability 0.1 to a random action. APPROPO focuses on the feasibility problem, so it requires to specify a lower bound on the reward, which we set to 0.3 for Mars rover and 0.1 for Box.