Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Pre-trained Large Language Models Learn to Predict Hidden Markov Models In-context

Authors: Yijia Dai, Zhaolin Gao, Yahya Sattar, Sarah Dean, Jennifer Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we present a comprehensive study on the ability of pre-trained LLMs to learn HMMs through in-context learning (Figure 1), revealing their surprisingly strong performance and offering actionable insights for real-world scientific experiments. A key finding is that pre-trained LLMs demonstrate a remarkable capacity to learn HMMs nearly optimally, achieving performance that approaches optimal Bayesian inference and often surpasses traditional statistical methods. These results not only advance our understanding of the emergent capabilities of in-context learning, but also introduce a novel and practical framework for using LLMs as powerful, efficient statistical tools in complex scientific data analysis. Our study makes three key contributions: 1. We conduct systematic, controlled experiments on synthetic HMMs and empirically show that pretrained LLMs outperform traditional statistical methods such as Baum Welch. Moreover, their prediction accuracy consistently converges to the theoretical optimum as given by the Viterbi algorithm with ground-truth model parameters across a wide range of HMM configurations (Section 2). 3. We translate our findings into practical guidelines for scientists, demonstrating how LLM incontext learning can serve as a diagnostic tool for assessing data complexity and uncovering underlying structure. When applied to real-world animal decision-making tasks, LLM ICL performs competitively with domain-specific models developed by human experts (Section 4).
Researcher Affiliation	Academia	Yijia Dai Zhaolin Gao Yahya Sattar Sarah Dean Jennifer J. Sun Cornell University
Pseudocode	Yes	Algorithm 1: Viterbi Algorithm Algorithm 2: Compute P(Ot+1\|Ot k:t) Algorithm 3: Baum-Welch Algorithm Algorithm 4: n-gram Based Next-Observation Prediction Algorithm 5: Trained Neural Networks for Single Sequence Prediction Algorithm 6: Spectral Learning-Based Prediction
Open Source Code	Yes	Our code is available at https://github.com/DaiYijia02/icl-hmm.
Open Datasets	Yes	Decision-making Mice Dataset: This dataset, developed by the International Brain Laboratory (IBL) [25], has gained significant traction for studying mouse behavior within the neuroscience community. Reward-learning Rats Dataset: The dataset from Miller et al. [36] allows us to explore LLM ICL capabilities on more complex learning behaviors.
Dataset Splits	No	The paper mentions data generation parameters for synthetic HMMs ("For each parameter configuration, we sample 4,096 state-observation sequence pairs, each of length 2,048") and the number of sequences used for baselines ("these results are averaged over 16 samples (vs. 4,096 elsewhere)"), but it does not specify explicit training/test/validation splits for any of the datasets, either synthetic or real-world.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments. It mentions using pre-trained LLMs (Qwen and Llama family) and training LSTM and Transformer models, but not the computational resources for these operations.
Software Dependencies	No	The paper mentions 'pytorch' in the context of solving an optimization problem for constructing HMM parameters ("which we solve using first order methods with pytorch"), but it does not provide a specific version number for PyTorch or any other software dependencies crucial for reproducibility.
Experiment Setup	Yes	Experiment setting: Our experiment follows a three-step protocol: First, we specify the HMM parameters λ = (π, A, B) according to our control variables (described below). Second, we generate observation sequences {o1, o2, . . .} from this parameterized model. Third, we evaluate the ability of candidate models to predict the next observation ot+1 given preceding observations o1:t. We systematically vary five control parameters... For each parameter configuration, we sample 4,096 state-observation sequence pairs, each of length 2,048. We assess model performance across context lengths ranging from 4 to 2,048 observations, specifically {4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048}. For each HMM setting, we report performance metrics averaged over the 4,096 samples. Our candidate models are open-source pre-trained LLMs (Qwen and Llama family). For RNN LSTM: In our experiments, we set the number of observations as the vocab size, use a two-layer LSTM with an embedding dimension of 16 and a hidden dimension of 8, and train for 10 epochs using the Adam optimizer with a learning rate of 1e-3. The results are averaged over 16 sequences. For Transformer: In our experiments, we set the number of observations as the vocab size, use a two-layer Transformer with an embedding dimension of 16, a hidden dimension of 8, and 5 attention heads, and train for 10 epochs using the Adam optimizer with a learning rate of 1e-3. The results are averaged over 16 sequences.