Branching Reinforcement Learning
Authors: Yihan Du, Wei Chen
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct experiments for branching RL. We set K = 5000, = 0.005, H = 6, m = 2, N {10, 15}, S = {s , s1, . . . , s5}. A is the collection of all m-cardinality subsets of Auniv = {a1, . . . , a N}, and thus |A| = N m {45, 105}. The reward function r(s, a) = 1 for any (s, a) S A. The trigger probability q(s, a) = 1 m for any (s, a) S {a N 1, a N}, and q(s, a) = 1 2m for any (s, a) S Auniv \ {a N 1, a N}. We set s1 as the initial state for each episode. Under all actions a Auniv, the transition probability q(s |s1, a) = 0.5 for any s {s2, s3}, and q(s |s, a) = 0.5 for any (s, s ) {s2, s3} {s4, s5} or (s, s ) {s4, s5} {s2, s3}. We perform 50 independent runs, and report the average regrets and running times (in legends) across runs. |
| Researcher Affiliation | Collaboration | 1IIIS, Tsinghua University, Beijing, China 2Microsoft Research. Correspondence to: Yihan Du <duyh18@mails.tsinghua.edu.cn>, Wei Chen <weic@microsoft.com>. |
| Pseudocode | Yes | Algorithm 1 Branch VI Algorithm 2 Branch RFE |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the methodology is openly available. |
| Open Datasets | No | The paper describes a constructed problem instance with specific parameters (H, m, N, S, reward function, trigger probability, transition probability) for its experiments, but it does not use a publicly available or open dataset. No concrete access information for a dataset is provided. |
| Dataset Splits | No | The paper defines the parameters of its constructed problem instance for the experiments, but it does not specify training, test, or validation dataset splits. The problem is a simulation within a defined environment, not a split of an existing dataset. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers for reproducing the experiments. |
| Experiment Setup | Yes | We set K = 5000, = 0.005, H = 6, m = 2, N {10, 15}, S = {s , s1, . . . , s5}. A is the collection of all m-cardinality subsets of Auniv = {a1, . . . , a N}, and thus |A| = N m {45, 105}. The reward function r(s, a) = 1 for any (s, a) S A. The trigger probability q(s, a) = 1 m for any (s, a) S {a N 1, a N}, and q(s, a) = 1 2m for any (s, a) S Auniv \ {a N 1, a N}. We set s1 as the initial state for each episode. Under all actions a Auniv, the transition probability q(s |s1, a) = 0.5 for any s {s2, s3}, and q(s |s, a) = 0.5 for any (s, s ) {s2, s3} {s4, s5} or (s, s ) {s4, s5} {s2, s3}. We perform 50 independent runs, and report the average regrets and running times (in legends) across runs. |