Are Language Models Actually Useful for Time Series Forecasting?

Authors: Mingtian Tan, Mike Merrill, Vinayak Gupta, Tim Althoff, Tom Hartvigsen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In a series of ablation studies on three recent and popular LLM-based time series forecasting methods, we find that removing the LLM component or replacing it with a basic attention layer does not degrade forecasting performance in most cases, the results even improve! We also find that despite their significant computational cost, pretrained LLMs do no better than models trained from scratch, do not represent the sequential dependencies in time series, and do not assist in few-shot settings. Additionally, we explore time series encoders and find that patching and attention structures perform similarly to LLM-based forecasters.1
Researcher Affiliation Academia Mingtian Tan University of Virginia wtd3gz@virginia.edu Mike A. Merrill University of Washington mikeam@cs.washington.edu Vinayak Gupta University of Washington vinayak@cs.washington.edu Tim Althoff University of Washington althoff@cs.washington Thomas Hartvigsen University of Virginia hartvigsen@virginia.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes All resources needed to reproduce our work are available: https://github.com/Benny TMT/ LLMs For Time Series
Open Datasets Yes We evaluate on the following real-world datasets: (1) ETT [21]: encompasses seven factors related to electricity transformers across four subsets: ETTh1 and ETTh2, which have hourly recordings, and ETTm1 and ETTm2, which have recordings every 15 minutes; (2) Illness [40]: includes the weekly recorded influenza illness among patients from the Centers for Disease Control, which describes the ratio of patients seen with influenza-like illness to the total number of patients; (3) Weather [40]: local climate data from 1,600 U.S. locations, between 2010 and 2013, and each data point consists of 11 climate features; (4) Traffic [40]: is an hourly dataset from California transportation department, and consists of road occupancy rates measured on San Francisco Bay area freeways; (5) Electricity [35]: contains the hourly electricity consumption of 321 customers from 2012 to 2014. The train-val-test split for ETT datasets is 60%-20%-20%, and for Illness, Weather, and Electricity datasets is 70%-10%-20% respectively. The statistics for all datasets is given in Table 1. We highlight that these datasets, with the same splits and size, have been extensively used to evaluate time-series forecasting ability of LLM-based and other neural models for time-series data [48, 50, 4, 15, 5, 46, 40, 49]. (6) Exchange Rate [18]: collected between 1990 and 2016, it contains daily exchange rates for the currencies of eight countries (Australia, British, Canada, Switzerland, China, Japan, New Zealand and Singapore). (7) Covid Deaths [13]: contains daily statistics of COVID-19 deaths in 266 countries and states between January and August 2020. (8) Taxi (30 min) [1]: contains taxi rides from 1,214 locations in New York City between January 2015 and January 2016. The data is collected every 30 minutes, with an average of 1,478 samples. (9) NN5 (Daily) [13]: contains daily cash withdrawal data from 111 ATMs in the UK, with each ATM having 791 data points. (10) FRED-MD [13]: contains 107 monthly macroeconomic indices released by the Federal Reserve Bank since 01/01/1959. It was extracted from the FRED-MD database.
Dataset Splits Yes The train-val-test split for ETT datasets is 60%-20%-20%, and for Illness, Weather, and Electricity datasets is 70%-10%-20% respectively.
Hardware Specification Yes For Time-LLM [15], applying Lla MA-7B [34], we use NVIDIA A100 GPU with 80GB memory. For other methods [50, 22], applying GPT-2 [29], we use NVIDIA RTX A6000 GPU with 48GB memory.
Software Dependencies No The paper mentions GPT-2 and LLaMA-7B as base models but does not provide specific version numbers for software libraries or dependencies (e.g., PyTorch, TensorFlow, scikit-learn versions).
Experiment Setup Yes When reproducing the reference methods, we used the original repository s hyper-parameters and model structures. In the ablation study, due to the smaller model parameters, we adjusted the learning rate or increased the batch size in some cases. All other training details remained identical with the reference methods.