Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Batch Policy Learning under Constraints
Authors: Hoang Le, Cameron Voloshin, Yisong Yue
ICML 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving. We validate our algorithm and analysis with two experimental settings. |
| Researcher Affiliation | Academia | Hoang M. Le 1 Cameron Voloshin 1 Yisong Yue 1 1California Institute of Technology, Pasadena, CA. Correspondence to: Hoang M. Le <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Meta-algo for Batch Constrained Learning, Algorithm 2 Constrained Batch Policy Learning, Algorithm 3 Fitted Q Evaluation: FQE(π, c) |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Environment & Data Collection. The environment is an 8x8 grid. The agent has 4 actions N,S,E,W at each state. The main goal is to navigate from a starting position to the goal. Each episode terminates when the agent reaches the goal or falls into a hole. The main cost function is defined as c = 1 if goal is reached, otherwise c = 0 everywhere. We simulate a non-optimal data gathering policy πD by adding random sub-optimal actions to the shortest path policy from any given state to goal. We run πD for 5000 trajectories to collect the behavior dataset D (with constraint cost measurement specified below). |
| Dataset Splits | No | The paper describes collecting a dataset D (e.g., 'We run πD for 5000 trajectories to collect the behavior dataset D'), and mentions 'test-time performance', but does not provide specific details on how this dataset is split into training, validation, and test sets (e.g., exact percentages or sample counts). |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions specific algorithms and models (e.g., DDQN, FQI, FQE, CNNs) but does not list any specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | No | The paper mentions general settings such as 'maximum horizon of 1000 for each episode' and 'set the threshold for each constraint to 75% of the DDQN benchmark', but does not provide specific hyperparameters like learning rates, batch sizes, optimizer details, or detailed neural network architectures. |