Batch Policy Learning under Constraints

Authors: Hoang Le, Cameron Voloshin, Yisong Yue

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving. We validate our algorithm and analysis with two experimental settings.
Researcher Affiliation Academia Hoang M. Le 1 Cameron Voloshin 1 Yisong Yue 1 1California Institute of Technology, Pasadena, CA. Correspondence to: Hoang M. Le <hmle@caltech.edu>.
Pseudocode Yes Algorithm 1 Meta-algo for Batch Constrained Learning, Algorithm 2 Constrained Batch Policy Learning, Algorithm 3 Fitted Q Evaluation: FQE(π, c)
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes Environment & Data Collection. The environment is an 8x8 grid. The agent has 4 actions N,S,E,W at each state. The main goal is to navigate from a starting position to the goal. Each episode terminates when the agent reaches the goal or falls into a hole. The main cost function is defined as c = 1 if goal is reached, otherwise c = 0 everywhere. We simulate a non-optimal data gathering policy πD by adding random sub-optimal actions to the shortest path policy from any given state to goal. We run πD for 5000 trajectories to collect the behavior dataset D (with constraint cost measurement specified below).
Dataset Splits No The paper describes collecting a dataset D (e.g., 'We run πD for 5000 trajectories to collect the behavior dataset D'), and mentions 'test-time performance', but does not provide specific details on how this dataset is split into training, validation, and test sets (e.g., exact percentages or sample counts).
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models, memory, or cloud instance types.
Software Dependencies No The paper mentions specific algorithms and models (e.g., DDQN, FQI, FQE, CNNs) but does not list any specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup No The paper mentions general settings such as 'maximum horizon of 1000 for each episode' and 'set the threshold for each constraint to 75% of the DDQN benchmark', but does not provide specific hyperparameters like learning rates, batch sizes, optimizer details, or detailed neural network architectures.