Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Predictive Querying for Autoregressive Neural Sequence Models
Authors: Alex Boyd, Samuel Showalter, Stephan Mandt, Padhraic Smyth
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across four large-scale sequence datasets from different application domains, as well as for the GPT-2 language model, we demonstrate the ability to make query answering tractable for arbitrary queries in exponentially-large predictive path-spaces, and find clear differences in cost-accuracy tradeoffs between search and sampling methods. We evaluate these methods across three user behavior datasets and two language datasets. |
| Researcher Affiliation | Academia | 1Department of Statistics 2Department of Computer Science University of California, Irvine EMAIL |
| Pseudocode | No | The paper describes methods and formulas but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | Code for this work is available at https://github.com/ajboyd2/prob_seq_queries. |
| Open Datasets | Yes | Reviews contains sequences of Amazon customer reviews for products belonging to one of V = 29 categories [Ni et al., 2019]; Mobile Apps consists of app usage records over V = 88 unique applications [Aliannejadi et al., 2021]; MOOCs consists of student interaction with online course resources over V = 98 actions [Kumar et al., 2019]. We also use the works of William Shakespeare [Shakespeare] by modeling the occurrence of V = 67 unique ASCII characters. Lastly, we examine Wiki Text [Merity et al., 2017] to explore word-level sequence modeling applications with GPT-2, a large-scale language model with a vocabulary of V = 50257 word-pieces [Radford et al., 2019, Wu et al., 2016]. |
| Dataset Splits | No | The paper states: 'For all datasets except Wiki Text, we train Long-short Term Memory (LSTM) networks until convergence.' and mentions 'test split' but does not explicitly provide details about a validation dataset split or percentage in the main text. |
| Hardware Specification | Yes | Model training and experimentation utilized roughly 200 NVIDIA Ge Force 2080ti GPU hours. |
| Software Dependencies | No | The paper mentions using 'GPT-2 with pre-trained weights from Hugging Face [Wolf et al., 2020]' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | To ensure an even comparison of query estimators, we fix the computation budget per query in terms of model calls fθ(hk) to be equal across all 3 methods, repeating experiments for different budget magnitudes roughly corresponding to O(10), O(102), O(103) model calls (see Appendix H for full details). and For each query history and method, we compute the hitting time query estimate p θ(τ(a) = K) over K = 3, . . . , 11, with a determined by the Kth symbol of the ground truth sequence. and To further investigate the effect of entropy, we alter each model by applying a temperature T > 0 to every conditional factor: pθ,T (Xk|X<k) pθ(Xk|X<k)1/T , effectively changing the entropy ranges for the models. |