Are Self-Attentions Effective for Time Series Forecasting?
Authors: Dongbin Kim, Jinseong Park, Jaewook Lee, Hoki Kim
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiment across various datasets demonstrates that our model achieves superior performance with the lowest mean squared error and uses fewer parameters compared to existing models. The implementation of our model is available at: https://github.com/dongbeank/CATS. |
| Researcher Affiliation | Academia | 1Seoul National University 2Chung-Ang University {dongbin413,jinseong,jaewook}@snu.ac.kr, hokikim@cau.ac.kr |
| Pseudocode | No | The paper describes the model architecture and components in detail through text and diagrams (Figure 4) but does not include a formal pseudocode block or algorithm. |
| Open Source Code | Yes | The implementation of our model is available at: https://github.com/dongbeank/CATS. |
| Open Datasets | Yes | To this end, we use 7 different real-world datasets and 9 baseline models. For datasets, we use Electricity, ETT (ETTh1, ETTh2, ETTm1, and ETTm2), Weather, Traffic, and M4. These datasets are provided in [23] and [24] for time series forecasting benchmark, detailed in Appendix. |
| Dataset Splits | Yes | For the forecasting horizon T, we also used the widely accepted values, i.e., [96, 192, 336, 720]. For our model, in all configurations, we adopt the Ge GLU activation function [16] between the two linear layers in the feed-forward network for our model. Additionally, we use learnable positional embedding parameters for the input data and omit positional embeddings for learnable queries to avoid redundant parameter learning. For the experiments summarized in Table 4 and Table 11, our model uses three cross-attention layers with embedding size D = 256, number of attention heads H = 32. Specifically, to avoid overfitting on small datasets [14], we use patch length 48 on the ETTh1 and ETTh2 datasets. Further details on the hyperparameter settings for these experiments are provided in Table 9. |
| Hardware Specification | Yes | We used 4 NVIDIA RTX 4090 24GB GPUs with 2 Intel(R) Xeon(R) Gold 5218R CPUs @ 2.10GHz for all experiments. |
| Software Dependencies | No | The paper mentions using a 'Ge GLU activation function' but does not specify version numbers for general software dependencies like Python, PyTorch, TensorFlow, or CUDA libraries. |
| Experiment Setup | Yes | For the forecasting horizon T, we also used the widely accepted values, i.e., [96, 192, 336, 720]. For our model, in all configurations, we adopt the Ge GLU activation function [16] between the two linear layers in the feed-forward network for our model. Additionally, we use learnable positional embedding parameters for the input data and omit positional embeddings for learnable queries to avoid redundant parameter learning. ... Further details on the hyperparameter settings for these experiments are provided in Table 9. (Table 9 specifies Layers, Embedding Size, Query Sharing, Input Sequence Length, Batch Size, Epoch, Learning Rate). |