Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Epistemic Monte Carlo Tree Search

Authors: Yaniv Oren, Viliam Vadocz, Matthijs T. J. Spaan, Wendelin Boehmer

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate EMCTS in the challenging, similar to real-world applications and sparse-reward task of programming in subleq, as well as in the commonly used hard-exploration benchmark Deep Sea (Osband et al., 2020) (Section 5.2). Our method is able to find correct programs for a harder programming task in a much smaller number of samples than the AZ baseline. In the Deep Sea benchmark, our method demonstrates deep exploration by solving stochastic and deterministic reward variations of the task, both of which baseline A/MZ is unable to solve in a reasonable number of samples. In addition, EMCTS significantly outperforms an ablation that does not rely on search for epistemic uncertainty estimation but is otherwise equivalent, demonstrating significant advantages from search for uncertainty estimation.
Researcher Affiliation Academia Yaniv Oren Delft University of Technology 2628 CD Delft, The Netherlands EMAIL
Pseudocode Yes See Algorithm 1 for pseudo code of EMCTS with EUCT, where we suppress dependence on the model Λ† M for notation simplicity. Extensions to MCTS are marked in blue.
Open Source Code Yes Our implementation, inspired by sic-1 which is an open source game demonstrating subleq, is available at https://github.com/emcts/e-alphazero.
Open Datasets Yes We evaluate EMCTS in the challenging, similar to real-world applications and sparse-reward task of programming in subleq, as well as in the commonly used hard-exploration benchmark Deep Sea (Osband et al., 2020).
Dataset Splits Yes In practice, rather than alternate between exploration and exploitation episodes we run a certain number of episodes in parallel, a certain portion of which are exploitatory and the rest are exploratory. In our experiments the ratio was 50/50. ... Table 3: Hyperparameters used in the Deep Sea experiments ... Evaluation episodes 8.
Hardware Specification No We acknowledge the use of computational resources of the Delft Blue supercomputer, provided by Delft High Performance Computing Centre (https://www.tudelft.nl/dhpc) as well as the DAIC cluster.
Software Dependencies No A parallelized implementation in JAX (Bradbury et al., 2018) of EMCTS paired with an AZ agent and an environment implementing the Assembly language subleq (Mazonka & Kolodin, 2011).
Experiment Setup Yes E Network Architecture & Hyperparameters E.1 Hyperparameter Search E.2 Network Architecture E.3 Deep Sea Hyperparameter Configuration E.4 SUBLEQ Hyperparameter Configuration ... Table 3: Hyperparameters used in the Deep Sea experiments Parameter Setting Comment Stacked Observations 1 Ξ³ 0.995 Number of simulations in MCTS 50 Dirichlet noise ratio (ΞΎ) 0.3 Root exploration fraction 0 Batch size 256 Learning rate 0.0005 Optimizer Adam (Kingma & Ba, 2015) Unroll steps l 5 Value target TD steps (nv) 5 UBE target TD steps (nu) 1 value support size 21 UBE support size 21 Reward support size 21 Reanalyzed policy ratio 0.99 See (Ye et al., 2021)