Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AliO: Output Alignment Matters in Long-Term Time Series Forecasting

Authors: Kwangryeol Park, Jaeho Kim, Seulki Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Ali O effectively improves the output alignment, i.e., up to 58.2% in TAM, while maintaining or enhancing the forecasting performance (up to 27.5%). This improved output alignment increases the reliability of the LTSF models, making them more applicable in real-world scenarios. The code implementation is on the Git Hub repository3.
Researcher Affiliation	Academia	Kwangryeol Park Artificial Intelligence Graduate School Ulsan National Institute of Science & Technology (UNIST), South Korea EMAIL Jaeho Kim Artificial Intelligence Korea University, South Korea EMAIL Seulki Lee Department of Computer Science and Engineering Ulsan National Institute of Science & Technology (UNIST), South Korea EMAIL
Pseudocode	Yes	Algorithm 1 The procedure of Ali O. denotes element-wise multiplication. FFT( , , ) returns the frequency domain representation of each signal sequences, sg( ) is stop-gradient operator.
Open Source Code	Yes	The code implementation is on the Git Hub repository3. 3https://github.com/eai-lab/Ali O
Open Datasets	Yes	We experiment with Ali O on representative LTSF tasks, including ETT{h1, h2, m1, m2}, Electricity (ECL), Traffic, Weather, and ILI dataset [37], using various state-of-the-art LTSF models such as Cycle Net [27], GPT4S [43], i Transformer [30], Patch TST [33], Times Net [36], DLinear [40] and Autoformer [37]. ... Electricity (ECL) [13]: Comprises electric power consumption data sampled every minute over four years for a single household. UCI Machine Learning Repository, 2006. DOI: https://doi.org/10.24432/C58K54. ILI (https: / / github.com / thuml / Autoformer): Weekly records from 2002 to 2021, provided by the US Centers for Disease Control and Prevention, representing the number of influenza-like illness patients.
Dataset Splits	Yes	Table 3: Descriptions of datasets. # of vars means the number of variate in each dataset. Dataset # of vars Prediction Length Dataset size (Train / Validation / Test) Frequency Domain ETTh1 7 24, 48, 168, 336, 720 (Autoformer) 96, 192, 336, 720 (other models) 8545 / 2881 / 2881 Hourly Temperature
Hardware Specification	Yes	All experiments were conducted on NVIDIA RTX 3090 and A6000 GPUs.
Software Dependencies	No	The optimizer we used is Adam [23]. To ensure fair model evaluation, we utilized the official Git Hub codes provided by six benchmark models: Autoformer [37], DLinear [40], Patch TST [33], Times Net [36], i Transformer [30], GPT4TS [43], Cycle Net [27]. For all experiments, we adopted the same prediction length, label length, and sequence length as the official implementations, and maintained the original architecture of each model.
Experiment Setup	Yes	For Ali O, we set N = 2 and l = 1, covering a wide prediction range. The search space for λT and λF are {1.0, 2.0, 5.0} and {0.0, 0.5, 1.0, 2.0}, respectively, and we report the best results in the main results. ... We used Mean Squared Error (MSE) as our baseline loss function and reported forecasting performance using MSE, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Root Mean Squared Error (RMSE), and TAM. For all these metrics, lower values indicate better performance. ... We used three random seeds (2023, 2024, 2025) for initialization and report the standard deviation using . ... We used the same learning rate, batch size, and epoch as the official implementation of each model for reproducibility and fair comparison. The implementation code for each model is as follows. ... Tables 4 to 11 list the learning rates, batch sizes, and epochs used for various models and datasets. The optimizer we used is Adam [23].