Batch Policy Learning under Constraints
Authors: Hoang Le, Cameron Voloshin, Yisong Yue
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving. We validate our algorithm and analysis with two experimental settings. |
| Researcher Affiliation | Academia | Hoang M. Le 1 Cameron Voloshin 1 Yisong Yue 1 1California Institute of Technology, Pasadena, CA. Correspondence to: Hoang M. Le <hmle@caltech.edu>. |
| Pseudocode | Yes | Algorithm 1 Meta-algo for Batch Constrained Learning, Algorithm 2 Constrained Batch Policy Learning, Algorithm 3 Fitted Q Evaluation: FQE(π, c) |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Environment & Data Collection. The environment is an 8x8 grid. The agent has 4 actions N,S,E,W at each state. The main goal is to navigate from a starting position to the goal. Each episode terminates when the agent reaches the goal or falls into a hole. The main cost function is defined as c = 1 if goal is reached, otherwise c = 0 everywhere. We simulate a non-optimal data gathering policy πD by adding random sub-optimal actions to the shortest path policy from any given state to goal. We run πD for 5000 trajectories to collect the behavior dataset D (with constraint cost measurement specified below). |
| Dataset Splits | No | The paper describes collecting a dataset D (e.g., 'We run πD for 5000 trajectories to collect the behavior dataset D'), and mentions 'test-time performance', but does not provide specific details on how this dataset is split into training, validation, and test sets (e.g., exact percentages or sample counts). |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions specific algorithms and models (e.g., DDQN, FQI, FQE, CNNs) but does not list any specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | No | The paper mentions general settings such as 'maximum horizon of 1000 for each episode' and 'set the threshold for each constraint to 75% of the DDQN benchmark', but does not provide specific hyperparameters like learning rates, batch sizes, optimizer details, or detailed neural network architectures. |