Predictive Querying for Autoregressive Neural Sequence Models

Authors: Alex Boyd, Samuel Showalter, Stephan Mandt, Padhraic Smyth

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across four large-scale sequence datasets from different application domains, as well as for the GPT-2 language model, we demonstrate the ability to make query answering tractable for arbitrary queries in exponentially-large predictive path-spaces, and find clear differences in cost-accuracy tradeoffs between search and sampling methods. We evaluate these methods across three user behavior datasets and two language datasets.
Researcher Affiliation Academia 1Department of Statistics 2Department of Computer Science University of California, Irvine {alexjb,showalte,mandt,p.smyth}@uci.edu
Pseudocode No The paper describes methods and formulas but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes Code for this work is available at https://github.com/ajboyd2/prob_seq_queries.
Open Datasets Yes Reviews contains sequences of Amazon customer reviews for products belonging to one of V = 29 categories [Ni et al., 2019]; Mobile Apps consists of app usage records over V = 88 unique applications [Aliannejadi et al., 2021]; MOOCs consists of student interaction with online course resources over V = 98 actions [Kumar et al., 2019]. We also use the works of William Shakespeare [Shakespeare] by modeling the occurrence of V = 67 unique ASCII characters. Lastly, we examine Wiki Text [Merity et al., 2017] to explore word-level sequence modeling applications with GPT-2, a large-scale language model with a vocabulary of V = 50257 word-pieces [Radford et al., 2019, Wu et al., 2016].
Dataset Splits No The paper states: 'For all datasets except Wiki Text, we train Long-short Term Memory (LSTM) networks until convergence.' and mentions 'test split' but does not explicitly provide details about a validation dataset split or percentage in the main text.
Hardware Specification Yes Model training and experimentation utilized roughly 200 NVIDIA Ge Force 2080ti GPU hours.
Software Dependencies No The paper mentions using 'GPT-2 with pre-trained weights from Hugging Face [Wolf et al., 2020]' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes To ensure an even comparison of query estimators, we fix the computation budget per query in terms of model calls fθ(hk) to be equal across all 3 methods, repeating experiments for different budget magnitudes roughly corresponding to O(10), O(102), O(103) model calls (see Appendix H for full details). and For each query history and method, we compute the hitting time query estimate p θ(τ(a) = K) over K = 3, . . . , 11, with a determined by the Kth symbol of the ground truth sequence. and To further investigate the effect of entropy, we alter each model by applying a temperature T > 0 to every conditional factor: pθ,T (Xk|X<k) pθ(Xk|X<k)1/T , effectively changing the entropy ranges for the models.