Large Language Models Are Zero-Shot Time Series Forecasters

Authors: Nate Gruver, Marc Finzi, Shikai Qiu, Andrew G. Wilson

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the zero-shot forecasting ability of LLMs by comparing LLMTIME with GPT-3 and LLa MA-2 70B to many popular time series baselines on a variety of benchmark time series datasets.
Researcher Affiliation Collaboration Acknowledgements. We thank Micah Goldblum, Greg Benton, and Wesley Maddox for helpful discussions. This work is supported by NSF CAREER IIS-2145492, NSF I-DISRE 193471, NSF IIS-1910266, Big Hat Biosciences, Capital One, and an Amazon Research Award.
Pseudocode No The paper describes the LLMTIME method in text (Section 3) and uses diagrams (e.g., Figure 1, Figure 3), but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes At its core, this method represents the time series as a string of numerical digits, and views time series forecasting as next-token prediction in text, unlocking Equal contribution 2https://github.com/ngruver/llmtime 37th Conference on Neural Information Processing Systems (Neur IPS 2023).
Open Datasets Yes We use three benchmark datasets that are common within deep learning research and many baseline methods that accompany the benchmark datasets. Darts [23]: A collection of 8 real univariate time series datasets. Monash [18]: The Monash forecasting archive contains 30 publicly available datasets... Informer [54]: We evaluated on multivariate datasets widely used for benchmarking efficient transformer models [16, 54].
Dataset Splits Yes We also experiment with an offset β based calculate as a percentile of the input data, and we tune these two parameters on validation log likelihoods (details in Appendix A).
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running its experiments. It mentions using LLMs like GPT-3 and LLaMA-2, and the 'cost of API queries' for GPT-3, implying reliance on external services, but without specifying the underlying hardware.
Software Dependencies No The paper mentions various models and libraries (e.g., Darts, Py SR, Autoformer, FEDFormer codebases) but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes To avoid wasting tokens when the inputs are very large, we scale values down so that the -percentile of rescaled time series values is 1. We also experiment with an offset β based calculate as a percentile of the input data, and we tune these two parameters on validation log likelihoods (details in Appendix A). To control sampling, we use temperature scaling, logit bias, and nucleus sampling (Appendix C). The full set of hyperparameters used for LLMTIME and the baseline methods are detailed in Appendix C.1.