Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Optimal Dynamic Regret by Transformers for Non-Stationary Reinforcement Learning

Authors: Baiyuan Chen, Shinji Ito, Masaaki Imaizumi

NeurIPS 2025 | Venue PDF | LLM Run Details | Input Tokens: 28,288 Total number of tokens sent to the LLM as input for this paper's analysis. | Output Tokens: 4,416 Total number of tokens produced by the LLM (including reasoning/thinking tokens) for this paper's analysis.

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate transformers and other algorithms in a linear bandit setting. The stochastic linear bandit framework is given by M pw , E, A1, . . . , AT q. At each round t P r Ts, the learner selects an action at P Rd from the set At tat,1, . . . , at,Au, which may vary over time. The learner then receives a reward rt xat, w y εt, where εt i.i.d. E and w P Rd is unknown. The problem generalizes by setting st At, with the state transitioning deterministically to st 1 regardless of the action. We compare transformers against Linear UCB (Lin UCB) and Thompson Sampling (TS), as well as MASTER (Wei and Luo, 2021) combined with Lin UCB/TS (denoted as expert algorithms) under environments with varying degrees of non-stationarity. In our experiments, we set d 32, A 10, εt Np0, 1.52q, and w Unif(r0, 1sd). We consider two types of environments: (1) Low Non-Stationarity: Models are evaluated over 1,000 rounds, with elevated rewards in t P r50, 100s Y r350, 400s scaled to rt P r3, 4s, and the remaining rewards in rt P r0, 1s. Training data consists of 100,000 samples with normalized rewards rt P r0, 1s. (2) High Non-Stationarity: The reward is defined as rt pxat, w y εtq cosp2πbtq. For training, we generate 100,000 samples for each b P t0.005, 0.01, 0.015, 0.02u. For evaluation, we test on unseen environments with b P t0.018, 0.025u, running 200 rounds per environment. For transformer models, we use GPT-2 with L 16 layers and M 16 attention heads, trained for 200 epochs. The objective is to assess generalization to non-stationary environments unseen during training. From Figure 3, we observe that transformers achieve performance comparable to, and sometimes surpassing, both the expert algorithms and their MASTER variants, attaining near-optimal cumulative regret. Additional experimental results can be found in Appendix B.
Researcher Affiliation	Academia	Baiyuan Chen The University of Tokyo EMAIL Shinji Ito The University of Tokyo / RIKEN AIP EMAIL Masaaki Imaizumi The University of Tokyo / RIKEN AIP EMAIL
Pseudocode	Yes	Algorithm 1: MALG (Multi-scale ALG)(Wei and Luo, 2021) Input: n, ρp q 1 for τ 0, . . . , 2n 1 do 2 for m n, n 1, . . . , 0 do 3 if τ is a multiple of 2m then 4 With probability ρp2nq{ρp2mq, schedule a new instance alg of ALG at scales 2m; 5 Run the active instance alg to output rrτ, select an action, and update with feedback. Algorithm 2: MALG with Stationarity TEsts and Restarts (MASTER)(Wei and Luo, 2021) Input: pρp q where pρptq 6plog2 T 1q logp Tqρptq (T: block length) 1 Initialize t Ð 1 2 for n 0, 1, . . . do 3 Set tn Ð t and initialize an MALG (Algorithm 2) for the block rtn, tn 2n 1s; 4 while t ă tn 2n do 5 Run MALG to obtain prediction rrt, select action at, and receive reward Rt; 6 Update MALG with feedback, and set Ut minτPrtn,ts rrτ; 7 Perform Test 1 and Test 2 (see below); 8 Increment t Ð t 1; 9 if either test returns fail then 10 restart from Line 2; 11 Test 1: If t alg.e for some order-m alg and 1 2m řalg.e τ alg.s Rτ ě Ut 9pρp2mq, return fail. 12 Test 2: If 1 t tn 1 řt tnprrτ rτq ě 3pρpt tn 1q, return fail.
Open Source Code	No	Justification: We clearly stated the source of data. Also, we will open the source code.
Open Datasets	No	In this section, we evaluate transformers and other algorithms in a linear bandit setting. The stochastic linear bandit framework is given by M pw , E, A1, . . . , AT q. At each round t P r Ts, the learner selects an action at P Rd from the set At tat,1, . . . , at,Au, which may vary over time. The learner then receives a reward rt xat, w y εt, where εt i.i.d. E and w P Rd is unknown. The problem generalizes by setting st At, with the state transitioning deterministically to st 1 regardless of the action. We compare transformers against Linear UCB (Lin UCB) and Thompson Sampling (TS), as well as MASTER (Wei and Luo, 2021) combined with Lin UCB/TS (denoted as expert algorithms) under environments with varying degrees of non-stationarity. In our experiments, we set d 32, A 10, εt Np0, 1.52q, and w Unif(r0, 1sd). We consider two types of environments: (1) Low Non-Stationarity: Models are evaluated over 1,000 rounds, with elevated rewards in t P r50, 100s Y r350, 400s scaled to rt P r3, 4s, and the remaining rewards in rt P r0, 1s. Training data consists of 100,000 samples with normalized rewards rt P r0, 1s. (2) High Non-Stationarity: The reward is defined as rt pxat, w y εtq cosp2πbtq. For training, we generate 100,000 samples for each b P t0.005, 0.01, 0.015, 0.02u. For evaluation, we test on unseen environments with b P t0.018, 0.025u, running 200 rounds per environment.
Dataset Splits	Yes	Training data consists of 100,000 samples with normalized rewards rt P r0, 1s. (2) High Non-Stationarity: The reward is defined as rt pxat, w y εtq cosp2πbtq. For training, we generate 100,000 samples for each b P t0.005, 0.01, 0.015, 0.02u. For evaluation, we test on unseen environments with b P t0.018, 0.025u, running 200 rounds per environment.
Hardware Specification	No	Justification: Our experiments are small-scale and implementable by a small laptop. Also, we do not pursue the computational cost in this study, so the computational resource is out of our focus.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers used for the experiments. It mentions GPT-2 as a model but not its software implementation details.
Experiment Setup	Yes	In our experiments, we set d 32, A 10, εt Np0, 1.52q, and w Unif(r0, 1sd). We consider two types of environments: (1) Low Non-Stationarity: Models are evaluated over 1,000 rounds, with elevated rewards in t P r50, 100s Y r350, 400s scaled to rt P r3, 4s, and the remaining rewards in rt P r0, 1s. Training data consists of 100,000 samples with normalized rewards rt P r0, 1s. (2) High Non-Stationarity: The reward is defined as rt pxat, w y εtq cosp2πbtq. For training, we generate 100,000 samples for each b P t0.005, 0.01, 0.015, 0.02u. For evaluation, we test on unseen environments with b P t0.018, 0.025u, running 200 rounds per environment. For transformer models, we use GPT-2 with L 16 layers and M 16 attention heads, trained for 200 epochs. The objective is to assess generalization to non-stationary environments unseen during training. ... In our experiments, we set the confidence scaling parameter α of Lin UCB to 1, and the noise variance for Thompson Sampling (TS) to 0.3.