Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Dynamic Subgoal-based Exploration via Bayesian Optimization
Authors: Yijia Wang, Matthias Poloczek, Daniel R. Jiang
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | An experimental evaluation demonstrates that the new approach outperforms existing baselines across a number of problem domains. We now show numerical experiments to demonstrate the cost-effectiveness of the BESD framework. |
| Researcher Affiliation | Collaboration | Yijia Wang EMAIL University of Pittsburgh Matthias Poloczek EMAIL Amazon Daniel R. Jiang EMAIL Meta AI, University of Pittsburgh |
| Pseudocode | Yes | Algorithm 1 Bayesian Exploratory Subgoal Design 1. Set n = 0. Estimate hyperparameters of the GP prior f using initial samples. 2. Compute next decision (θn, τ n, qn) according to the acquisition function (7). 3. Train in environment ξn+1 augmented with θn (Mξn+1,θn) using levers (τ n, qn). 4. Observe yn+1(θn, τ n) and update posterior on f. 5. If n < N, increment n and return to Step 2. 6. Return a subgoal recommendation θN rec that maximizes µN(θ, τmax). |
| Open Source Code | Yes | BESD is implemented using the MOE package (Clark et al., 2014) and the full source code be found at the following URL: https://github.com/yjwang0618/subgoal-based-exploration. |
| Open Datasets | No | The first set of environments (GW10) is a distribution over 10 10 gridworlds... The second domain (GW20) is a distribution of larger 20 20 gridworlds... The third domain (TR) is a distribution of 10 10 gridworlds... The mountain car (MC) domain, as we introduced in Example 2, is a commonly used RL benchmark environment... In domains KEY2 (with two subgoals) and KEY3 (with three subgoals), we consider a 10 10 gridworld... |
| Dataset Splits | Yes | In our setup, an agent is given a fixed (and small) number of opportunities to train in environments randomly drawn from a distribution Ξ (henceforth, we refer to these as training environments)... After these opportunities are exhausted, the agent enters a random test environment ξ Ξ... For each replication, to assess the performance at a particular point in the process, we take its latest recommendation and test it by averaging its performance on a random sample of 200 test MDPs (i.e., ξN). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or other computer specifications used for running the experiments. It only mentions general computational context like 'RL training itself did not use prohibitive amounts of computation'. |
| Software Dependencies | No | BESD is implemented using the MOE package (Clark et al., 2014)... The underlying RL algorithm for all environments is Q-learning Watkins & Dayan (1992)... Both EI and LCB are implemented using the GPy Opt package González (2016). |
| Experiment Setup | Yes | The potential function at state s with the jth subgoal activated is Φj(s) = w1 exp[ -0.5(s j)2/w2], where the height is w1 = 0.2 and width is w2 = 10. The underlying RL algorithm for all environments is Q-learning with an ϵ-greedy behavioral policy (with ϵ = 0.2) for all environments. We use T = {200, 600, 1000} for the possible values of τ and Q = {5, 20} for the possible values of q [for GW10]. In this experiment, we consider the case of only allowing BESD to select the maximum episode length from T = {4000, 7000, 10000}, while keeping q = 20 fixed [for GW20]. The discount factor is set to γ = 0.98 [for TR]. Setting η = 3 (the default value) and R = 81, HB consists of logη R rounds. |