LeRet: Language-Empowered Retentive Network for Time Series Forecasting

Authors: Qihe Huang, Zhengyang Zhou, Kuo Yang, Gengyu Lin, Zhongchao Yi, Yang Wang

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations demonstrate the effectiveness of our Le Ret, especially reveal superiority on fewshot, and zero-shot forecasting tasks. Code is available at https://github.com/hqh0728/Le Ret. 4 Experiments 4.1 Datasets and Experimental Setups 4.2 Main Results 4.3 Ablation Study
Researcher Affiliation Academia 1 University of Science and Technology of China (USTC), Hefei, China 2 Suzhou Institute for Advanced Research, USTC, Suzhou, China 3 State Key Laboratory of Resources and Environmental Information System
Pseudocode No The paper describes the method using flow diagrams (Figure 2, Figure 3) and descriptive text, but no formal pseudocode or algorithm blocks are provided.
Open Source Code Yes Code is available at https://github.com/hqh0728/Le Ret.
Open Datasets Yes We evaluate performance of long-term forecasting on Weather, Traffic, Solar, Electricity and four ETT datasets (i.e., ETTh1, ETTh2, ETTm1, and ETTm2), which have been extensively adopted for benchmarking long-term forecasting models. For short-term forecasting, we adopt the Pe MS which contains four public traffic network datasets (PEMS03, PEMS04, PEMS07, PEMS08).
Dataset Splits No The input time series length L is set as 336 for all baselines, and we use four different prediction horizons T {96, 192, 336, 720}. For short-term forecasting, we adopt the Pe MS which contains four public traffic network datasets (PEMS03, PEMS04, PEMS07, PEMS08). All the models are following the same experimental setup with input length L = 96 and prediction length T = 12. In few-shot learning, only 10% of the training data timesteps are utilized. The paper describes input and prediction lengths and partial training data usage for few-shot learning, but does not explicitly state the standard train/validation/test splits (e.g., percentages or counts) for the main experiments.
Hardware Specification No No specific hardware details (e.g., GPU model, CPU model, memory) used for experiments are mentioned in the paper.
Software Dependencies No Since we choose LLa Ma as the LLM, which is a decoderonly architecture, under this causal encoding, each token can only perceive itself and the tokens before it. The paper mentions LLa Ma as the LLM but does not specify a version number. No other software dependencies with version numbers are provided.
Experiment Setup Yes The input time series length L is set as 336 for all baselines, and we use four different prediction horizons T {96, 192, 336, 720}. For short-term forecasting, we adopt the Pe MS which contains four public traffic network datasets (PEMS03, PEMS04, PEMS07, PEMS08). All the models are following the same experimental setup with input length L = 96 and prediction length T = 12. We partition it into non-overlapping patches of length P, resulting in a total of N = L P + 1 input patches Xpatch RN P . These patches are embedded as Xpe RN dp using a simple linear layer: Xpe = Linear(Reshape(Xinput)). The model employs h = dm/d retention heads in each layer, where d is the head dimension. Multi-scale retention (MSR) assigns different γ for each head, adding a swish gate to increase non-linearity.