Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Architectural and Inferential Inductive Biases for Exchangeable Sequence Modeling

Authors: Daksh Mittal, Leon Li, Thomson Yen, C. Guetta, Hongseok Namkoong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically evaluate the impact of single-step and multi-step inference on uncertainty quantification, as well as on downstream optimization tasks such as multi-armed bandits and active learning. We find that multi-step inference significantly outperforms one-step inference, being up to 60% more efficient in bandit settings and requiring up to 10 times less data in active learning to achieve the same predictive performance. 2 3.2.1 Uncertainty quantification (UQ) We evaluate one-step and multi-step inference by generating datasets using Gaussian Processes (GP), a common choice in prior works [22, 23]. Specifically, we employ a GP with an RBF kernel: f GP(m, K), where K(X, X ) = σ2 f exp \|\|X X \|\|2 2/2ℓ2 . Additionally, Gaussian noise N(0, σ2) is added to the outputs. The input X is drawn i.i.d. from PX. To compare the performance of the two inference strategies, we use the multi-step log-loss metric. Further details on the metrics and experimental setup can be found in Section B. Figure 3(a) illustrates the comparison of multi-step log-loss performance between one-step and multi-step inference. Consistent with our theoretical results (Theorems 2 and 4), the results demonstrate that one-step inference performs worse than multi-step inference.
Researcher Affiliation	Academia	Daksh Mittal , Ang Li , Thomson Yen , Daniel Guetta, Hongseok Namkoong Columbia University EMAIL
Pseudocode	Yes	Algorithm 1 One-step and Multi-step inference (Thompson sampling) using sequence models (transformers) in multi armed bandits setting Algorithm 2 Thompson Sampling for Multi Armed Bandits (Gaussian-Gaussian setting) Algorithm 3 One-step and Multi-step inference (Uncertainty sampling) using sequence models (transformers) in active learning setting
Open Source Code	Yes	Our code repository is available at: https://github.com/namkoong-lab/ Inductive-biases-exchangeable-sequence.
Open Datasets	No	Data generating process: As previously mentioned, we generate data synthetically using Gaussian processes. Specifically, we employ a Gaussian Process (GP) with a Radial Basis Function (RBF) kernel: f GP(m, K), where m(X) represents the mean function, and K(X, X ) = σ2 f exp \|\|X X \|\|2 2 2ℓ2 represents the covariance function. Additionally, Gaussian noise N(0, σ2) is added to the outputs. The input X is drawn i.i.d. from PX. Unless stated otherwise, the parameters are set as follows: m(X) = 0, X U[ 2.0, 2.0], σf = 1.0, ℓ= 1.0, σ = 0.1.
Dataset Splits	No	For each experiment, we use 8192 test samples and conduct evaluations across five different random seeds. This process includes retraining the models on different training datasets and evaluating them on distinct test datasets.
Hardware Specification	Yes	Computational resources: We use NVIDIA A100-SXM4-80GB for training our models. For the standard-causal architecture it takes 4hr, while for C-permutation variant architecture it takes 17hr to train the model.
Software Dependencies	No	For training the transformers, we use the Adam optimizer with default parameters, and the learning rate is adjusted using a cosine scheduler. The training parameters are as follows: Warmup ratio is 0.03, minimium learning rate is 3.0e 5, learning rate is 0.0003, weight decay to 0.01 and batch size is 64. For all the experiments we train the transformer for 400 epochs.
Experiment Setup	Yes	Transfromer architecture and training details: To compare the conditionally permutationinvariant architecture with the standard causal architecture, we use a decoder-only transformer with the following parameters. Both architectures share the same parameters, differing only in their masking schemes. The model parameters are as follows: Model dimension: 64 Feedforward dimension: 256 Number of attention heads: 4 Number of transformer layers: 4 Dropout: 0.1 Activation function: GELU For embedding (x, y), we use a neural network with two layers of sizes [256, 64]. Additionally, a final linear layer is used to predict the mean µ and standard deviation σ of the output distribution, modeled as Y N(µ, σ2). For training the transformers, we use the Adam optimizer with default parameters, and the learning rate is adjusted using a cosine scheduler. The training parameters are as follows: Warmup ratio is 0.03, minimium learning rate is 3.0e 5, learning rate is 0.0003, weight decay to 0.01 and batch size is 64. For all the experiments we train the transformer for 400 epochs.