Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Provable Emergence of In-Context Reinforcement Learning
Authors: Jiuqi Wang, Rohan Chandra, Shangtong Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct empirical studies to verify our theoretical results. In particular, mirroring our theoretical setup, we investigate the following two questions: 1. Does the pretraining yield a Transformer that can perform in-context policy evaluation? 2. Do the converged parameters align with θTD? We answer both questions for both multi-task TD and multi-task MC. |
| Researcher Affiliation | Academia | Jiuqi Wang Department of Computer Science University of Virginia Charlottesville, VA 22903 EMAIL Rohan Chandra Department of Computer Science University of Virginia Charlottesville, VA 22903 EMAIL Shangtong Zhang Department of Computer Science University of Virginia Charlottesville, VA 22903 EMAIL |
| Pseudocode | Yes | A Multi-task TD and MC We provide the pseudocode of multi-task TD and MC in this section. Algorithm 1: Multi-Task Temporal Difference Learning (adapted from Algorithm 1 of Wang et al. [2025]) Algorithm 2: Multi-Task Monte Carlo Learning Algorithm 3: Boyan Chain MRP Generation (Adapted from Algorithm 2 of Wang et al. [2025]) Algorithm 4: Loop MRP Generation |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We still need to clean up the codebase and create instructions to run our code. We will submit our code as part of the supplementary material. |
| Open Datasets | No | We employ Boyan’s chain [Boyan, 1999] as our environment to construct the MRPs. Boyan’s chain allows us to have full information and control over the environment, including analytically solving for the true values and the stationary distributions. Figure 3 shows an example of an S-state Boyan’s chain. Adapting the technique of Wang et al. [2025], we randomly generate the reward and transition probability functions, preserving the topology of the chain to ensure its ergodicity. Our only simplifying modification is setting the stationary distribution as the initial distribution. The details of the task generation, including the distributions to sample p and r, can be found in Algorithm 3. |
| Dataset Splits | Yes | For each trial of the experiment, we generate 20,000 tasks for training with Îł = 0.9. Within each task characterized by p, r, we generate a mini-batch of size b = 64. After each trial, we sample k = 10 novel tasks T1, . . . , Tk from the task distribution as our validation set. |
| Hardware Specification | Yes | C.2 Compute Resources We run our experiments in parallel on a single node of a CPU cluster. The node has 150 CPU cores and 150 GB of memory. |
| Software Dependencies | No | We use NumPy [Harris et al., 2020] for data processing and implementing the MRPs. We use PyTorch [Ansel et al., 2024] to create and train our models. For data visualization, we use Matplotlib [Hunter, 2007] to create the plots. |
| Experiment Setup | Yes | Table 1: Hyperparameters and more training details. optimizer Adam [Kingma and Ba, 2015] learning rate 0.001 weight decay 0.0 batch size 64 # of attention layers 30 # of Boyan’s chain states 5 discount factor 0.9 # of Monte Carlo rollout steps 200 # of random seeds 20 # of Boyan’s chain tasks for training 20,000 # of validation instances 10 validation context lengths 5, 10, ..., 100 |