Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Random Policy Enables In-Context Reinforcement Learning within Trust Horizons
Authors: Weiqin Chen, Santiago Paternain
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Moreover, our empirical results across multiple popular ICRL benchmark environments demonstrate that, on average, SAD outperforms the best baseline by 236.3% in the offline evaluation and by 135.2% in the online evaluation. |
| Researcher Affiliation | Academia | Weiqin Chen EMAIL Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute Santiago Paternain EMAIL Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute |
| Pseudocode | Yes | Algorithm 1 Collecting Contexts under Random Policy Algorithm 2 Collecting Query States and Action Labels under Random Policy (MAB) Algorithm 3 Collecting Query States and Action Labels under Random Policy (MDP) Algorithm 4 State-Action Distillation (SAD) under Random Policy Algorithm 5 Collecting Query States and Action Labels under Random Policy (Sparse-Reward MDP) Algorithm 6 Pretraining and Deployment of SAD (Inspired by (Lee et al., 2024)) |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | In this section, we substantiate the efficacy of our proposed SAD method on five ICRL benchmark problems: Gaussian Bandits, Bernoulli Bandits, Darkroom, Darkroom-Large, Miniworld, which are commonly considered in the ICRL literature (Laskin et al., 2022; Lee et al., 2024; Dong et al., 2024). |
| Dataset Splits | Yes | Given 7 x 7 = 49 available goals, we utilize 39 of these goals (~80%) for pretraining and reserve the remaining 10 (~20%) (unseen during pretraining) for test. We still consider 80% of the 100 available goals for pretraining and the remaining unseen 20% goals for test. Table 2: The main hyperparameters of each environment...Training/Test ratio 0.8/0.2 |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions the transformer architecture (causal GPT2 model) and hyperparameters. |
| Software Dependencies | No | The paper mentions using a "causal GPT2 model" for the transformer architecture, but it does not specify any software libraries or frameworks (e.g., PyTorch, TensorFlow) with their version numbers that would be necessary for replication. |
| Experiment Setup | Yes | Table 1: The main hyperparameters of each algorithm Hyperparameters AD DPT DIT SAD (ours) Number of attention heads 3 3 3 3 Number of attention layers 3 3 3 3 Embedding size 32 32 32 32 Learning rate 0.001 0.001 0.001 0.001 Dropout 0.1 0.1 0.1 0.1 Table 2: The main hyperparameters of each environment Hyperparameters Gaussian Bandits ... Miniworld Trust Horizon 320 ... 3 # of epochs 100 ... 200 Context horizon 500 ... 50 |