Branching Reinforcement Learning

Authors: Yihan Du, Wei Chen

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct experiments for branching RL. We set K = 5000, = 0.005, H = 6, m = 2, N {10, 15}, S = {s , s1, . . . , s5}. A is the collection of all m-cardinality subsets of Auniv = {a1, . . . , a N}, and thus |A| = N m {45, 105}. The reward function r(s, a) = 1 for any (s, a) S A. The trigger probability q(s, a) = 1 m for any (s, a) S {a N 1, a N}, and q(s, a) = 1 2m for any (s, a) S Auniv \ {a N 1, a N}. We set s1 as the initial state for each episode. Under all actions a Auniv, the transition probability q(s |s1, a) = 0.5 for any s {s2, s3}, and q(s |s, a) = 0.5 for any (s, s ) {s2, s3} {s4, s5} or (s, s ) {s4, s5} {s2, s3}. We perform 50 independent runs, and report the average regrets and running times (in legends) across runs.
Researcher Affiliation Collaboration 1IIIS, Tsinghua University, Beijing, China 2Microsoft Research. Correspondence to: Yihan Du <duyh18@mails.tsinghua.edu.cn>, Wei Chen <weic@microsoft.com>.
Pseudocode Yes Algorithm 1 Branch VI Algorithm 2 Branch RFE
Open Source Code No The paper does not provide any statement or link indicating that the source code for the methodology is openly available.
Open Datasets No The paper describes a constructed problem instance with specific parameters (H, m, N, S, reward function, trigger probability, transition probability) for its experiments, but it does not use a publicly available or open dataset. No concrete access information for a dataset is provided.
Dataset Splits No The paper defines the parameters of its constructed problem instance for the experiments, but it does not specify training, test, or validation dataset splits. The problem is a simulation within a defined environment, not a split of an existing dataset.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers for reproducing the experiments.
Experiment Setup Yes We set K = 5000, = 0.005, H = 6, m = 2, N {10, 15}, S = {s , s1, . . . , s5}. A is the collection of all m-cardinality subsets of Auniv = {a1, . . . , a N}, and thus |A| = N m {45, 105}. The reward function r(s, a) = 1 for any (s, a) S A. The trigger probability q(s, a) = 1 m for any (s, a) S {a N 1, a N}, and q(s, a) = 1 2m for any (s, a) S Auniv \ {a N 1, a N}. We set s1 as the initial state for each episode. Under all actions a Auniv, the transition probability q(s |s1, a) = 0.5 for any s {s2, s3}, and q(s |s, a) = 0.5 for any (s, s ) {s2, s3} {s4, s5} or (s, s ) {s4, s5} {s2, s3}. We perform 50 independent runs, and report the average regrets and running times (in legends) across runs.