Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Random Policy Enables In-Context Reinforcement Learning within Trust Horizons

Authors: Weiqin Chen, Santiago Paternain

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Moreover, our empirical results across multiple popular ICRL benchmark environments demonstrate that, on average, SAD outperforms the best baseline by 236.3% in the offline evaluation and by 135.2% in the online evaluation.
Researcher Affiliation Academia Weiqin Chen EMAIL Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute Santiago Paternain EMAIL Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute
Pseudocode Yes Algorithm 1 Collecting Contexts under Random Policy Algorithm 2 Collecting Query States and Action Labels under Random Policy (MAB) Algorithm 3 Collecting Query States and Action Labels under Random Policy (MDP) Algorithm 4 State-Action Distillation (SAD) under Random Policy Algorithm 5 Collecting Query States and Action Labels under Random Policy (Sparse-Reward MDP) Algorithm 6 Pretraining and Deployment of SAD (Inspired by (Lee et al., 2024))
Open Source Code No The paper does not contain any explicit statement about releasing source code for the methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes In this section, we substantiate the efficacy of our proposed SAD method on five ICRL benchmark problems: Gaussian Bandits, Bernoulli Bandits, Darkroom, Darkroom-Large, Miniworld, which are commonly considered in the ICRL literature (Laskin et al., 2022; Lee et al., 2024; Dong et al., 2024).
Dataset Splits Yes Given 7 x 7 = 49 available goals, we utilize 39 of these goals (~80%) for pretraining and reserve the remaining 10 (~20%) (unseen during pretraining) for test. We still consider 80% of the 100 available goals for pretraining and the remaining unseen 20% goals for test. Table 2: The main hyperparameters of each environment...Training/Test ratio 0.8/0.2
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions the transformer architecture (causal GPT2 model) and hyperparameters.
Software Dependencies No The paper mentions using a "causal GPT2 model" for the transformer architecture, but it does not specify any software libraries or frameworks (e.g., PyTorch, TensorFlow) with their version numbers that would be necessary for replication.
Experiment Setup Yes Table 1: The main hyperparameters of each algorithm Hyperparameters AD DPT DIT SAD (ours) Number of attention heads 3 3 3 3 Number of attention layers 3 3 3 3 Embedding size 32 32 32 32 Learning rate 0.001 0.001 0.001 0.001 Dropout 0.1 0.1 0.1 0.1 Table 2: The main hyperparameters of each environment Hyperparameters Gaussian Bandits ... Miniworld Trust Horizon 320 ... 3 # of epochs 100 ... 200 Context horizon 500 ... 50