Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Branching Reinforcement Learning
Authors: Yihan Du, Wei Chen
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct experiments for branching RL. We set K = 5000, = 0.005, H = 6, m = 2, N {10, 15}, S = {s , s1, . . . , s5}. A is the collection of all m-cardinality subsets of Auniv = {a1, . . . , a N}, and thus |A| = N m {45, 105}. The reward function r(s, a) = 1 for any (s, a) S A. The trigger probability q(s, a) = 1 m for any (s, a) S {a N 1, a N}, and q(s, a) = 1 2m for any (s, a) S Auniv \ {a N 1, a N}. We set s1 as the initial state for each episode. Under all actions a Auniv, the transition probability q(s |s1, a) = 0.5 for any s {s2, s3}, and q(s |s, a) = 0.5 for any (s, s ) {s2, s3} {s4, s5} or (s, s ) {s4, s5} {s2, s3}. We perform 50 independent runs, and report the average regrets and running times (in legends) across runs. |
| Researcher Affiliation | Collaboration | 1IIIS, Tsinghua University, Beijing, China 2Microsoft Research. Correspondence to: Yihan Du <EMAIL>, Wei Chen <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Branch VI Algorithm 2 Branch RFE |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the methodology is openly available. |
| Open Datasets | No | The paper describes a constructed problem instance with specific parameters (H, m, N, S, reward function, trigger probability, transition probability) for its experiments, but it does not use a publicly available or open dataset. No concrete access information for a dataset is provided. |
| Dataset Splits | No | The paper defines the parameters of its constructed problem instance for the experiments, but it does not specify training, test, or validation dataset splits. The problem is a simulation within a defined environment, not a split of an existing dataset. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers for reproducing the experiments. |
| Experiment Setup | Yes | We set K = 5000, = 0.005, H = 6, m = 2, N {10, 15}, S = {s , s1, . . . , s5}. A is the collection of all m-cardinality subsets of Auniv = {a1, . . . , a N}, and thus |A| = N m {45, 105}. The reward function r(s, a) = 1 for any (s, a) S A. The trigger probability q(s, a) = 1 m for any (s, a) S {a N 1, a N}, and q(s, a) = 1 2m for any (s, a) S Auniv \ {a N 1, a N}. We set s1 as the initial state for each episode. Under all actions a Auniv, the transition probability q(s |s1, a) = 0.5 for any s {s2, s3}, and q(s |s, a) = 0.5 for any (s, s ) {s2, s3} {s4, s5} or (s, s ) {s4, s5} {s2, s3}. We perform 50 independent runs, and report the average regrets and running times (in legends) across runs. |