Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Optimal Dynamic Regret by Transformers for Non-Stationary Reinforcement Learning
Authors: Baiyuan Chen, Shinji Ito, Masaaki Imaizumi
NeurIPS 2025 | Venue PDF | LLM Run Details | Input Tokens: 28,288 Total number of tokens sent to the LLM as input for this paper's analysis. | Output Tokens: 4,416 Total number of tokens produced by the LLM (including reasoning/thinking tokens) for this paper's analysis.
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate transformers and other algorithms in a linear bandit setting. The stochastic linear bandit framework is given by M pw , E, A1, . . . , AT q. At each round t P r Ts, the learner selects an action at P Rd from the set At tat,1, . . . , at,Au, which may vary over time. The learner then receives a reward rt xat, w y εt, where εt i.i.d. E and w P Rd is unknown. The problem generalizes by setting st At, with the state transitioning deterministically to st 1 regardless of the action. We compare transformers against Linear UCB (Lin UCB) and Thompson Sampling (TS), as well as MASTER (Wei and Luo, 2021) combined with Lin UCB/TS (denoted as expert algorithms) under environments with varying degrees of non-stationarity. In our experiments, we set d 32, A 10, εt Np0, 1.52q, and w Unif(r0, 1sd). We consider two types of environments: (1) Low Non-Stationarity: Models are evaluated over 1,000 rounds, with elevated rewards in t P r50, 100s Y r350, 400s scaled to rt P r3, 4s, and the remaining rewards in rt P r0, 1s. Training data consists of 100,000 samples with normalized rewards rt P r0, 1s. (2) High Non-Stationarity: The reward is defined as rt pxat, w y εtq cosp2πbtq. For training, we generate 100,000 samples for each b P t0.005, 0.01, 0.015, 0.02u. For evaluation, we test on unseen environments with b P t0.018, 0.025u, running 200 rounds per environment. For transformer models, we use GPT-2 with L 16 layers and M 16 attention heads, trained for 200 epochs. The objective is to assess generalization to non-stationary environments unseen during training. From Figure 3, we observe that transformers achieve performance comparable to, and sometimes surpassing, both the expert algorithms and their MASTER variants, attaining near-optimal cumulative regret. Additional experimental results can be found in Appendix B. |
| Researcher Affiliation | Academia | Baiyuan Chen The University of Tokyo EMAIL Shinji Ito The University of Tokyo / RIKEN AIP EMAIL Masaaki Imaizumi The University of Tokyo / RIKEN AIP EMAIL |
| Pseudocode | Yes | Algorithm 1: MALG (Multi-scale ALG)(Wei and Luo, 2021) Input: n, ρp q 1 for τ 0, . . . , 2n 1 do 2 for m n, n 1, . . . , 0 do 3 if τ is a multiple of 2m then 4 With probability ρp2nq{ρp2mq, schedule a new instance alg of ALG at scales 2m; 5 Run the active instance alg to output rrτ, select an action, and update with feedback. Algorithm 2: MALG with Stationarity TEsts and Restarts (MASTER)(Wei and Luo, 2021) Input: pρp q where pρptq 6plog2 T 1q logp Tqρptq (T: block length) 1 Initialize t Ð 1 2 for n 0, 1, . . . do 3 Set tn Ð t and initialize an MALG (Algorithm 2) for the block rtn, tn 2n 1s; 4 while t ă tn 2n do 5 Run MALG to obtain prediction rrt, select action at, and receive reward Rt; 6 Update MALG with feedback, and set Ut minτPrtn,ts rrτ; 7 Perform Test 1 and Test 2 (see below); 8 Increment t Ð t 1; 9 if either test returns fail then 10 restart from Line 2; 11 Test 1: If t alg.e for some order-m alg and 1 2m řalg.e τ alg.s Rτ ě Ut 9pρp2mq, return fail. 12 Test 2: If 1 t tn 1 řt tnprrτ rτq ě 3pρpt tn 1q, return fail. |
| Open Source Code | No | Justification: We clearly stated the source of data. Also, we will open the source code. |
| Open Datasets | No | In this section, we evaluate transformers and other algorithms in a linear bandit setting. The stochastic linear bandit framework is given by M pw , E, A1, . . . , AT q. At each round t P r Ts, the learner selects an action at P Rd from the set At tat,1, . . . , at,Au, which may vary over time. The learner then receives a reward rt xat, w y εt, where εt i.i.d. E and w P Rd is unknown. The problem generalizes by setting st At, with the state transitioning deterministically to st 1 regardless of the action. We compare transformers against Linear UCB (Lin UCB) and Thompson Sampling (TS), as well as MASTER (Wei and Luo, 2021) combined with Lin UCB/TS (denoted as expert algorithms) under environments with varying degrees of non-stationarity. In our experiments, we set d 32, A 10, εt Np0, 1.52q, and w Unif(r0, 1sd). We consider two types of environments: (1) Low Non-Stationarity: Models are evaluated over 1,000 rounds, with elevated rewards in t P r50, 100s Y r350, 400s scaled to rt P r3, 4s, and the remaining rewards in rt P r0, 1s. Training data consists of 100,000 samples with normalized rewards rt P r0, 1s. (2) High Non-Stationarity: The reward is defined as rt pxat, w y εtq cosp2πbtq. For training, we generate 100,000 samples for each b P t0.005, 0.01, 0.015, 0.02u. For evaluation, we test on unseen environments with b P t0.018, 0.025u, running 200 rounds per environment. |
| Dataset Splits | Yes | Training data consists of 100,000 samples with normalized rewards rt P r0, 1s. (2) High Non-Stationarity: The reward is defined as rt pxat, w y εtq cosp2πbtq. For training, we generate 100,000 samples for each b P t0.005, 0.01, 0.015, 0.02u. For evaluation, we test on unseen environments with b P t0.018, 0.025u, running 200 rounds per environment. |
| Hardware Specification | No | Justification: Our experiments are small-scale and implementable by a small laptop. Also, we do not pursue the computational cost in this study, so the computational resource is out of our focus. |
| Software Dependencies | No | The paper does not explicitly mention specific software dependencies with version numbers used for the experiments. It mentions GPT-2 as a model but not its software implementation details. |
| Experiment Setup | Yes | In our experiments, we set d 32, A 10, εt Np0, 1.52q, and w Unif(r0, 1sd). We consider two types of environments: (1) Low Non-Stationarity: Models are evaluated over 1,000 rounds, with elevated rewards in t P r50, 100s Y r350, 400s scaled to rt P r3, 4s, and the remaining rewards in rt P r0, 1s. Training data consists of 100,000 samples with normalized rewards rt P r0, 1s. (2) High Non-Stationarity: The reward is defined as rt pxat, w y εtq cosp2πbtq. For training, we generate 100,000 samples for each b P t0.005, 0.01, 0.015, 0.02u. For evaluation, we test on unseen environments with b P t0.018, 0.025u, running 200 rounds per environment. For transformer models, we use GPT-2 with L 16 layers and M 16 attention heads, trained for 200 epochs. The objective is to assess generalization to non-stationary environments unseen during training. ... In our experiments, we set the confidence scaling parameter α of Lin UCB to 1, and the noise variance for Thompson Sampling (TS) to 0.3. |