Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Architectural and Inferential Inductive Biases for Exchangeable Sequence Modeling

Authors: Daksh Mittal, Leon Li, Thomson Yen, C. Guetta, Hongseok Namkoong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically evaluate the impact of single-step and multi-step inference on uncertainty quantification, as well as on downstream optimization tasks such as multi-armed bandits and active learning. We find that multi-step inference significantly outperforms one-step inference, being up to 60% more efficient in bandit settings and requiring up to 10 times less data in active learning to achieve the same predictive performance. 2 3.2.1 Uncertainty quantification (UQ) We evaluate one-step and multi-step inference by generating datasets using Gaussian Processes (GP), a common choice in prior works [22, 23]. Specifically, we employ a GP with an RBF kernel: f GP(m, K), where K(X, X ) = ฯƒ2 f exp ||X X ||2 2/2โ„“2 . Additionally, Gaussian noise N(0, ฯƒ2) is added to the outputs. The input X is drawn i.i.d. from PX. To compare the performance of the two inference strategies, we use the multi-step log-loss metric. Further details on the metrics and experimental setup can be found in Section B. Figure 3(a) illustrates the comparison of multi-step log-loss performance between one-step and multi-step inference. Consistent with our theoretical results (Theorems 2 and 4), the results demonstrate that one-step inference performs worse than multi-step inference.
Researcher Affiliation Academia Daksh Mittal , Ang Li , Thomson Yen , Daniel Guetta, Hongseok Namkoong Columbia University EMAIL
Pseudocode Yes Algorithm 1 One-step and Multi-step inference (Thompson sampling) using sequence models (transformers) in multi armed bandits setting Algorithm 2 Thompson Sampling for Multi Armed Bandits (Gaussian-Gaussian setting) Algorithm 3 One-step and Multi-step inference (Uncertainty sampling) using sequence models (transformers) in active learning setting
Open Source Code Yes Our code repository is available at: https://github.com/namkoong-lab/ Inductive-biases-exchangeable-sequence.
Open Datasets No Data generating process: As previously mentioned, we generate data synthetically using Gaussian processes. Specifically, we employ a Gaussian Process (GP) with a Radial Basis Function (RBF) kernel: f GP(m, K), where m(X) represents the mean function, and K(X, X ) = ฯƒ2 f exp ||X X ||2 2 2โ„“2 represents the covariance function. Additionally, Gaussian noise N(0, ฯƒ2) is added to the outputs. The input X is drawn i.i.d. from PX. Unless stated otherwise, the parameters are set as follows: m(X) = 0, X U[ 2.0, 2.0], ฯƒf = 1.0, โ„“= 1.0, ฯƒ = 0.1.
Dataset Splits No For each experiment, we use 8192 test samples and conduct evaluations across five different random seeds. This process includes retraining the models on different training datasets and evaluating them on distinct test datasets.
Hardware Specification Yes Computational resources: We use NVIDIA A100-SXM4-80GB for training our models. For the standard-causal architecture it takes 4hr, while for C-permutation variant architecture it takes 17hr to train the model.
Software Dependencies No For training the transformers, we use the Adam optimizer with default parameters, and the learning rate is adjusted using a cosine scheduler. The training parameters are as follows: Warmup ratio is 0.03, minimium learning rate is 3.0e 5, learning rate is 0.0003, weight decay to 0.01 and batch size is 64. For all the experiments we train the transformer for 400 epochs.
Experiment Setup Yes Transfromer architecture and training details: To compare the conditionally permutationinvariant architecture with the standard causal architecture, we use a decoder-only transformer with the following parameters. Both architectures share the same parameters, differing only in their masking schemes. The model parameters are as follows: Model dimension: 64 Feedforward dimension: 256 Number of attention heads: 4 Number of transformer layers: 4 Dropout: 0.1 Activation function: GELU For embedding (x, y), we use a neural network with two layers of sizes [256, 64]. Additionally, a final linear layer is used to predict the mean ยต and standard deviation ฯƒ of the output distribution, modeled as Y N(ยต, ฯƒ2). For training the transformers, we use the Adam optimizer with default parameters, and the learning rate is adjusted using a cosine scheduler. The training parameters are as follows: Warmup ratio is 0.03, minimium learning rate is 3.0e 5, learning rate is 0.0003, weight decay to 0.01 and batch size is 64. For all the experiments we train the transformer for 400 epochs.