Predictive Querying for Autoregressive Neural Sequence Models
Authors: Alex Boyd, Samuel Showalter, Stephan Mandt, Padhraic Smyth
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across four large-scale sequence datasets from different application domains, as well as for the GPT-2 language model, we demonstrate the ability to make query answering tractable for arbitrary queries in exponentially-large predictive path-spaces, and find clear differences in cost-accuracy tradeoffs between search and sampling methods. We evaluate these methods across three user behavior datasets and two language datasets. |
| Researcher Affiliation | Academia | 1Department of Statistics 2Department of Computer Science University of California, Irvine {alexjb,showalte,mandt,p.smyth}@uci.edu |
| Pseudocode | No | The paper describes methods and formulas but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | Code for this work is available at https://github.com/ajboyd2/prob_seq_queries. |
| Open Datasets | Yes | Reviews contains sequences of Amazon customer reviews for products belonging to one of V = 29 categories [Ni et al., 2019]; Mobile Apps consists of app usage records over V = 88 unique applications [Aliannejadi et al., 2021]; MOOCs consists of student interaction with online course resources over V = 98 actions [Kumar et al., 2019]. We also use the works of William Shakespeare [Shakespeare] by modeling the occurrence of V = 67 unique ASCII characters. Lastly, we examine Wiki Text [Merity et al., 2017] to explore word-level sequence modeling applications with GPT-2, a large-scale language model with a vocabulary of V = 50257 word-pieces [Radford et al., 2019, Wu et al., 2016]. |
| Dataset Splits | No | The paper states: 'For all datasets except Wiki Text, we train Long-short Term Memory (LSTM) networks until convergence.' and mentions 'test split' but does not explicitly provide details about a validation dataset split or percentage in the main text. |
| Hardware Specification | Yes | Model training and experimentation utilized roughly 200 NVIDIA Ge Force 2080ti GPU hours. |
| Software Dependencies | No | The paper mentions using 'GPT-2 with pre-trained weights from Hugging Face [Wolf et al., 2020]' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | To ensure an even comparison of query estimators, we fix the computation budget per query in terms of model calls fθ(hk) to be equal across all 3 methods, repeating experiments for different budget magnitudes roughly corresponding to O(10), O(102), O(103) model calls (see Appendix H for full details). and For each query history and method, we compute the hitting time query estimate p θ(τ(a) = K) over K = 3, . . . , 11, with a determined by the Kth symbol of the ground truth sequence. and To further investigate the effect of entropy, we alter each model by applying a temperature T > 0 to every conditional factor: pθ,T (Xk|X<k) pθ(Xk|X<k)1/T , effectively changing the entropy ranges for the models. |