Introducing Spectral Attention for Long-Range Dependency in Time Series Forecasting
Authors: Bong Gyun Kang, Dongjun Lee, HyunGi Kim, Dohyun Chung, Sungroh Yoon
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on 11 real-world time series datasets using 7 recent forecasting models, we consistently demonstrate the efficacy of our Spectral Attention mechanism, achieving state-of-the-art results. |
| Researcher Affiliation | Academia | Bong Gyun Kang1 Dongjun Lee1 Hyun Gi Kim2 Do Hyun Chung3 Sungroh Yoon1,2 1 Interdisciplinary Program in Artificial Intelligence, Seoul National University 2 Department of Electrical and Computer Engineering, Seoul National University 3 Department of Future Automotive Mobility, Seoul National University |
| Pseudocode | Yes | Algorithm 1 Batched Spectral Attention (1 epoch) |
| Open Source Code | Yes | The full code is available at https://github.com/DJLee1208/BSA_2024. |
| Open Datasets | Yes | We use eleven real-world public datasets: Weather, Traffic, ECL, ETT (4 sub-datasets; h1, h2, m1, m2), Exchange, PEMS03, Energy Data, and Illness [6, 26, 29, 49]. All these public datasets were downloaded from the referenced sources in March 2024. |
| Dataset Splits | Yes | Train, validation, and test split ratio are 0.6, 0.2, 0.2 for the ETT dataset and 0.7, 0.1, 0.2 for the Weather, Traffic, ECL, Exchange, PEMS03, Energy Data, and Illness datasets. and Model selection Hyperparameter search is conducted based on the validation set. |
| Hardware Specification | Yes | Each experiment was conducted on a single NVIDIA Ge Force RTX 3090Ti or NVIDIA A40 or NVIDIA L40 GPU. |
| Software Dependencies | No | The whole code is implemented in Py Torch [38]. |
| Experiment Setup | Yes | We first train the base model for more than 30 epochs (20 epochs for the Traffic dataset) using Adam [22] to ensure that the validation MSE saturates, while also conducting an extensive hyperparameter search. The hyperparameter search space for the base model is as follows: the possible learning rate is (0.03, 0.01, 0.003, 0.001, 0.0003), and the weight decay is (0.01, 0.003, 0.001, 0.0003, 0.0001, 0.00003). The hyperparameter search space for BSA finetuning is as follows: the possible learning rate for the SA-Matrix in the BSA module is (0.08, 0.05, 0.03, 0.01, 0.003, 0.001), learning rate for the rest of the model, i.e. original modules, is (0.01, 0.003, 0.001, 0.0003, 0.0001, 0.00003, 0.00001), learning rate for smoothing factor αk is (none, 0.03, 0.01, 0.003, 0.001, 0.0001, 0.00001), initialization for smoothing factor αk is ([0.9, 0.99, 0.999], [0.9, 0.99, 0.999, 0.999], [0.9, 0.95, 0.992, 0.999], [0.8, 0.96, 0.992, 0.9984, 0.99968]). The default batch size for baseline model saturation is 64, while for our method which involves fine-tuning after integrating the BSA module it is 256. We used the ADAM [22] optimizer and L2 loss (MSE loss) for the model optimization. |