Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Belief State Transformer

Authors: Edward Hu, Kwangjun Ahn, Qinghua Liu, Haoran Xu, Manan Tomar, Ada Langford, Dinesh Jayaraman, Alex Lamb, John Langford

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical ablations show that each component of the model is essential in difficult scenarios where standard Transformers fall short. For the task of story writing with known prefixes and suffixes, our approach outperforms the Fill-in-the-Middle method for reaching known goals and demonstrates improved performance even when the goals are unknown. In Section 3 we then study in depth how the Belief State Transformer performs on a known-hard problem, the star graph. Figure 2: The Belief State Transformer outperforms baselines in all star graph navigation tasks.
Researcher Affiliation	Collaboration	1Microsoft Research 2University of Pennsylvania 3UT Austin 4 University of Alberta
Pseudocode	Yes	Algorithm 1 Goal-conditioned Planning. Algorithm 2 Beam Search.
Open Source Code	Yes	Website: https://edwhu.github.io/bst-website. See Appendix E for code and scaling rules. Figure 12: A simple implementation of the belief state transformer objective. Figure 13: Efficient computation of all prefix-suffix losses.
Open Datasets	Yes	We use Tiny Stories (Eldan & Li, 2023), a dataset consisting of synthetic short stories.
Dataset Splits	Yes	We tokenize the dataset into a vocabulary space of size 1000, and discard stories greater than 256 tokens long resulting in a dataset consisting of 2.7 million stories. During evaluation, the models generate text using prefix-suffix snippets from an evaluation set of 100 unseen stories. We evaluate the models on 100 held out stories from the Tiny Stories dataset. For each story, we use the first 50 tokens for the prompt, and last 100 tokens for the suffix.
Hardware Specification	Yes	Each model is trained on a single A100 / H100 GPU with 80GB memory. We train all models on a single A100 / H100 GPU with 80GB memory.
Software Dependencies	No	The paper does not explicitly state specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9). It mentions 'import torch' and 'import torch.nn as nn' in pseudocode, implying PyTorch, but without a version.
Experiment Setup	Yes	Both the forward and backward encoders consist of nlayers = 6 layers with an embedding dimension of 768, nhead = 8 attention heads, and an MLP expansion factor of 1. In all cases, we use the Adam W optimizer with a weight decay strength of 0.1. For G(2, 5), the learning rate is set to η = 3 10 4, while for G(5, 5) and G(2, 20), a smaller learning rate of η = 1 10 4 is used. We run all experiments for 100 epochs to ensure convergence. The Belief State Transformer s encoders have the following settings: nlayers = 8, blocks with embedding dimension edim = 768, and nheads = 8. The textheads Tn, Tp are implemented as a single 3-layer MLP with dimensionality 512 and Re LU activations that outputs two predictions. The total model size is 80 million parameters. We trained the Belief State transformer on 1 epoch of the Tiny Stories dataset with 2.7M stories, with a batch size of 256.