Constrained episodic reinforcement learning in concave-convex and knapsack settings
Authors: Kianté Brantley, Miro Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, Wen Sun
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that the proposed algorithm significantly outperforms these approaches in constrained episodic benchmarks. |
| Researcher Affiliation | Collaboration | Kianté Brantley University of Maryland kdbrant@cs.umd.edu; Miroslav Dudík Microsoft Research mdudik@microsoft.com; Thodoris Lykouris Microsoft Research thlykour@microsoft.com; Sobhan Miryoosefi Princeton University miryoosefi@cs.princeton.edu; Max Simchowitz UC Berkeley msimchow@berkeley.edu; Aleksandrs Slivkins Microsoft Research slivkins@microsoft.com; Wen Sun Cornell University ws455@cornell.edu |
| Pseudocode | No | The paper describes algorithms and their components (e.g., CONRL, CONPLANNER) and how to solve optimization problems as linear programs, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/miryoosefi/Con RL |
| Open Datasets | Yes | We run our experiments on two grid-world environments Mars rover (Tessler et al., 2019) and Box (Leike et al., 2017). |
| Dataset Splits | No | The paper describes running experiments on grid-world environments and training over a number of trajectories, but it does not specify traditional dataset splits (e.g., training, validation, test percentages or counts) as commonly seen in supervised learning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | The episode horizon H is 30 and the agent s action is perturbed with probability 0.1 to a random action. APPROPO focuses on the feasibility problem, so it requires to specify a lower bound on the reward, which we set to 0.3 for Mars rover and 0.1 for Box. |