Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Delving into Large Language Models for Effective Time-Series Anomaly Detection

Authors: JUN WOO PARK, Kyudan Jung, Dohyun Lee, Hyuck Lee, DAEHOON GWAK, ChaeHun Park, Jaegul Choo, Jaewoong Cho

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method outperforms 21 existing prompting strategies on the Anom LLM benchmark, achieving up to a 66.6% improvement in F1 score. We further compare LLMs with 16 non-LLM baselines on the TSB-AD benchmark, highlighting scenarios where LLMs offer unique advantages via contextual reasoning. Our findings provide empirical insights into how and when LLMs can be effective for TSAD.
Researcher Affiliation Collaboration Junwoo Park1,2, Kyudan Jung1,2, Dohyun Lee1,2, Hyuck Lee2, Daehoon Gwak1, Chae Hun Park1, Jaegul Choo1, Jaewoong Cho2 1KAIST AI 2KRAFTON EMAIL EMAIL
Pseudocode Yes Algorithm 1 Component Detection in Time Series
Open Source Code Yes The code is publicly available at: https://github.com/junwoopark92/LLM-TSAD.
Open Datasets Yes We leverage the well-curated datasets introduced by Anom LLM [89], which include four representative anomaly types point, range, trend, and frequency with accurately annotated intervals. These datasets are designed to facilitate fine-grained evaluation under various anomaly conditions. To complement these controlled settings, we also incorporate the TSB-AD benchmark [46], which reflects unsupervised anomaly detection in more realistic time-series applications.
Dataset Splits No The dataset consists of 1,600 time series samples, with 400 instances per anomaly type. We convert the original interval-based labels into binary labels by marking whether any anomaly interval is present in a given time series, effectively framing the task as instance-level TSAD without requiring localization. ... To reduce the overall experimental cost, we limit the evaluation set by selecting time series from eight categories (highlighted in green in Table 8) within the TSB-AD-U benchmark, focusing on those with relatively shorter lengths.
Hardware Specification Yes The open-source models are hosted on an A100 4-GPU machine using the lmdeploy library, and queries are issued locally through this setup.
Software Dependencies No For the open-source models, we employ Intern VL2-Llama3-76B and Qwen2.5-VL-72B-Instruct. For the API-based models, we use Gemini-1.5-Flash by Google and GPT-4o by Open AI. The open-source models are hosted on an A100 4-GPU machine using the lmdeploy library... Specifically, we employed the seasonal_decompose function from the statsmodels package
Experiment Setup Yes The experimental setup follows the same configuration as the TSAD task in Anom LLM, with the only modification being the binary output format. We adopt the F1-Macro score to fairly assess performance across both normal and anomalous classes... Specifically, we used the following hyperparameters. thresh_trend=0.57 thresh_seasonal=0.1 thresh_resid=0.15