Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting

Authors: Jiecheng Lu, Xu Han, Yan Sun, Shihao Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.
Researcher Affiliation Collaboration 1Georgia Institute of Technology 2AWS. Correspondence to: Jiecheng Lu <EMAIL>, Shihao Yang <EMAIL>.
Pseudocode No The paper describes methods in prose and mathematical notation but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code implementation is available at the following link.
Open Datasets Yes Our main MTSF experiments are conducted on 12 widely-used real-world time series datasets. These datasets are summarized as follows: Weather Dataset1(Wu et al., 2021) ... Solar Dataset2(Lai et al., 2018) ... Electricity Dataset3(Wu et al., 2021) ... ETT Dataset4(Zhou et al., 2021) ... Traffic Dataset5(Wu et al., 2021) ... PEMS Dataset6(Li et al., 2017).
Dataset Splits Yes We use the same train-validation-test set splitting ratio as in previous studies by Zeng et al. (2023); Nie et al. (2022); Liu et al. (2024b). We also follow the same dataset standardization methods used in these studies.
Hardware Specification Yes All training tasks in this paper can be conducted using a single Nvidia RTX 4090 GPU.
Software Dependencies No The paper mentions software components like 'Adam W optimizer', 'Layer Norm', and 'RMSNorm' but does not provide specific version numbers for any of these or other key software dependencies.
Experiment Setup Yes For the hyper-parameter settings of the pure AR/WAVE Transformer, we use m = 3 Transformer layers, 8 heads, and set the hidden dimension d based on the number of series C, using the empirical formula d = 16 * C. We use 4d as the hidden dimension for the feedforward MLP in the Transformer layer. A dropout rate of 0.1 is applied to both the AR term and MA term. We initialize the weights of all linear layers and embedding layers using the GPT-2 weight initialization method, with a normal distribution and a standard deviation of 0.02. For the output projection layers in the attention and MLP, we additionally scale the standard deviation by a factor of 1/ m. The batch size is set to 32. During training, pure AR/WAVE Transformers are trained using the next-step prediction objective with MSE loss. We use the Adam W optimizer with betas=(0.9, 0.95) and weight decay=0.1. We evaluate the validation and test losses at the end of each epoch, with an early-stopping patience set to 12 epochs. The maximum number of training epochs is 100. We apply a linear warm-up for the learning rate, increasing it from 0.00006 to 0.0006 over the first 5 epochs, and gradually decreasing it in the subsequent epochs.