reproducibilityindex.ai

Predictive Querying for Autoregressive Neural Sequence Models

Authors: Alex Boyd, Samuel Showalter, Stephan Mandt, Padhraic Smyth

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across four large-scale sequence datasets from different application domains, as well as for the GPT-2 language model, we demonstrate the ability to make query answering tractable for arbitrary queries in exponentially-large predictive path-spaces, and find clear differences in cost-accuracy tradeoffs between search and sampling methods. We evaluate these methods across three user behavior datasets and two language datasets.
Researcher Affiliation	Academia	1Department of Statistics 2Department of Computer Science University of California, Irvine {alexjb,showalte,mandt,p.smyth}@uci.edu
Pseudocode	No	The paper describes methods and formulas but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code	Yes	Code for this work is available at https://github.com/ajboyd2/prob_seq_queries.
Open Datasets	Yes	Reviews contains sequences of Amazon customer reviews for products belonging to one of V = 29 categories [Ni et al., 2019]; Mobile Apps consists of app usage records over V = 88 unique applications [Aliannejadi et al., 2021]; MOOCs consists of student interaction with online course resources over V = 98 actions [Kumar et al., 2019]. We also use the works of William Shakespeare [Shakespeare] by modeling the occurrence of V = 67 unique ASCII characters. Lastly, we examine Wiki Text [Merity et al., 2017] to explore word-level sequence modeling applications with GPT-2, a large-scale language model with a vocabulary of V = 50257 word-pieces [Radford et al., 2019, Wu et al., 2016].
Dataset Splits	No	The paper states: 'For all datasets except Wiki Text, we train Long-short Term Memory (LSTM) networks until convergence.' and mentions 'test split' but does not explicitly provide details about a validation dataset split or percentage in the main text.
Hardware Specification	Yes	Model training and experimentation utilized roughly 200 NVIDIA Ge Force 2080ti GPU hours.
Software Dependencies	No	The paper mentions using 'GPT-2 with pre-trained weights from Hugging Face [Wolf et al., 2020]' but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	To ensure an even comparison of query estimators, we fix the computation budget per query in terms of model calls fθ(hk) to be equal across all 3 methods, repeating experiments for different budget magnitudes roughly corresponding to O(10), O(102), O(103) model calls (see Appendix H for full details). and For each query history and method, we compute the hitting time query estimate p θ(τ(a) = K) over K = 3, . . . , 11, with a determined by the Kth symbol of the ground truth sequence. and To further investigate the effect of entropy, we alter each model by applying a temperature T > 0 to every conditional factor: pθ,T (Xk\|X<k) pθ(Xk\|X<k)1/T , effectively changing the entropy ranges for the models.