DAM: Towards a Foundation Model for Forecasting

Authors: Luke Nicholas Darlow, Qiwen Deng, Ahmed Hassan, Martin Asenov, Rajkarn Singh, Artjom Joosen, Adam Barker, Amos Storkey

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that a single univariate DAM, trained on 25 time series datasets, either outperformed or closely matched existing So TA models at multivariate long-term forecasting across 18 datasets, including 8 heldout for zero-shot transfer, even though these models were trained to specialise for each dataset-horizon combination. This single DAM excels at zero-shot transfer and very-long-term forecasting, performs well at imputation, is interpretable via basis function composition and attention, can be tuned for different inference-cost requirements, is robust to missing and irregularly sampled data by design.
Researcher Affiliation Collaboration Luke Darlow, Qiwen Deng, Ahmed Hassan, Martin Asenov, Rajkarn Singh, Artjom Joosen, Adam Barker Systems Infrastructure Research Edinburgh Research Centre Central Software Institute Huawei Edinburgh, UK sirlab@huawei.com Amos Storkey School of Informatics University of Edinburgh Edinburgh, UK a.storkey@ed.ac.uk
Pseudocode Yes We included a simplified version of the Py Torch code for the DAM in Appendix B and for initialising basis coefficients in Appendix E. We also provided this as supplementary material for ease of use. Listing 1: Working Py Torch (Paszke et al., 2019) code for the DAM architecture. Listing 2: Pytorch (Paszke et al., 2019) code for θ0 initialisation.
Open Source Code Yes We included a simplified version of the Py Torch code for the DAM in Appendix B and for initialising basis coefficients in Appendix E. We also provided this as supplementary material for ease of use.
Open Datasets Yes We used a total of 33 datasets for training and evaluation. We augment 10 commonly used datasets (following e.g. Wu et al., 2021; Zhou et al., 2021; Liu et al., 2021) that we split into train/valid/test, with another 15 datasets that are additionally used to enhance training (details in Appendix H). The 10 datasets are used to test within-dataset generalisation (Section 4.1) and they are: ETTh1, h2, m1, and m2; ECL; Traffic; Weather; USWeather; Exchange; and Wind. In Section 4.2 we test outwith generalisation on 8 held-out datasets, namely: Illness, Weekdays, UCIPower, Azure, MTemp, MWeb, MWeather, and MMeters. Appendix H.1 and H.2 contain tables listing these datasets with specific details and sources/citations, indicating public availability.
Dataset Splits Yes We used a total of 33 datasets for training and evaluation. We augment 10 commonly used datasets (following e.g. Wu et al., 2021; Zhou et al., 2021; Liu et al., 2021) that we split into train/valid/test, with another 15 datasets that are additionally used to enhance training (details in Appendix H). Table 5: Dataset details. Res. is the dataset resolution, or sampling rate. The listed horizons are those that are commonly tested on. Num lists the number of variables (i.e., columns) in each dataset. Dataset Res. Horizons Train/valid/test Num Domain ETTm1, ETTm2 15 mins [96,192,336,720] [34465, 11521, 11521] 7 Electricity
Hardware Specification Yes On a NVIDIA A40 GPU they take between 20 minutes and 1.5 hours.
Software Dependencies No The paper mentions 'PyTorch (Paszke et al., 2019)' in the context of its code listings, but does not provide specific version numbers for PyTorch or other software libraries required for reproducibility.
Experiment Setup Yes Model hyper parameters. The following model hyper-parameters were used for the DAM in this paper: Model width, dmodel, of 256. Feed-forward internal width, dff, of 256. 4 MHSA and cross-attention heads. A To ME reduction target of 250 TV-tokens. Dropout of 0.1. Time units of 1 day, such that δt = 1 denotes one day from now and δt = 1 is one day into the past. Training hyper parameters. The following training hyper-parameters were used for the DAM in this paper: Minibatch size of 32. ... Two phase learning: (a) 1,000,000 iterations with 10,000 warmup steps followed by cosine annealing from 1 3 to 1 14; and (b) an additional 50,000 iterations with 2,000 warmup steps followed by cosine annealing from 1 3 to 0. Gradient clipping to the 90th percentile of the latest 1000 gradients of all model weights. HSR context size and σ of 540 and 720, respectively. 540 target points (also sampled from the HSR) over which to compute the loss. Random seed of 42 (the answer).