Scalable Online Exploration via Coverability
Authors: Philip Amortila, Dylan J Foster, Akshay Krishnamurthy
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find that L1-Coverage effectively drives off-the-shelf policy optimization algorithms to explore the state space. We present proof-of-concept experiments to validate our theoretical results. |
| Researcher Affiliation | Collaboration | 1University of Illinois, Urbana-Champaign. 2Microsoft Research. |
| Pseudocode | Yes | Algorithm 1 Approximate Policy Cover Computation via L -Coverability Relaxation. Algorithm 2 Coverage-Driven Exploration (CODEX). |
| Open Source Code | Yes | Code available at github.com/philip-amortila/l1-coverability. |
| Open Datasets | Yes | We focus on the planning problem (Section 4), and consider the classical Mountain Car environment (Brockman et al., 2016). |
| Dataset Splits | No | The paper describes the environment setup and data generation process (e.g., 'deterministic starting state,' 'discretization'), but it does not specify explicit train/validation/test dataset splits or percentages for a pre-existing dataset. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions software like 'PyTorch (Paszke et al., 2019)', 'Adam optimizer (Kingma & Ba, 2015)', and 'Open AI Gym (Brockman et al., 2016)', but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We take a discount factor of 0.99, and a variance smoothing parameter of σ = 0.05. We train REINFORCE with horizons of length 400. We take πt, the policy which approximates Line 4 of Algorithm 1, to be the policy returned after 1000 REINFORCE updates, with one update after each rollout. The update in REINFORCE use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 10 3. We estimate all occupancies with N = 100 rollouts of length H = 200. We train for 20 epochs, corresponding to T = 20 in the loop of Line 3 of Algorithm 1. For the regularized reward of Eq. (16), we take ε = 10 4. |