VQ-TR: Vector Quantized Attention for Time Series Forecasting

Authors: Kashif Rasul, Andrew Bennett, Pablo Vicente, Umang Gupta, Hena Ghonia, Anderson Schneider, Yuriy Nevmyvaka

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this comparison, we find that VQ-TR performs better or comparably to all other methods while being computationally efficient.
Researcher Affiliation Collaboration Kashif Rasul, Andrew Bennett, Pablo Vicente, Anderson Schneider & Yuriy Nevmyvaka Morgan Stanley, New York, USA kashif.rasul@gmail.com Umang Gupta USC, Los Angeles, USA Hena Ghonia Université de Montréal, Montréal, Canada
Pseudocode Yes Section D.6, titled "VQ-TR IMPLEMENTATION DETAILS", provides Python code for various components of the VQ-TR model, including `FeedForward`, `Attention`, `VQAttention`, and `VQTrModel` classes, demonstrating the structured implementation.
Open Source Code No The full code will be published on acceptance, and hyperparameter details are provided in D.3. Full complete details for running these experiments will be available with the code release.
Open Datasets Yes We use the following open datasets: Exchange (Lai et al., 2018), Solar (Lai et al., 2018), Elecricity3, Traffic4, Taxi5, and Wikipedia6 preprocessed exactly as in Salinas et al. (2019a). Footnotes 3, 4, 5, 6 provide URLs for Electricity, Traffic, Taxi, Wikipedia.
Dataset Splits No The paper discusses training data (Dtrain) and testing (Dtest) and mentions using context/prediction windows, but it does not specify a distinct validation split (e.g., percentages or methodology) for hyperparameter tuning or early stopping.
Hardware Specification Yes The experiments were performed on a single Tesla V100S GPU with 32GB of RAM.
Software Dependencies No Section D.6 provides code snippets using PyTorch modules (e.g., `torch`, `torch.nn`, `torch.nn.functional`) and `vector_quantize_pytorch`, but specific version numbers for these libraries or for Python itself are not mentioned.
Experiment Setup Yes We use two encoder layers and six decoder layers, i.e., N = 2 and M = 6. We use J = 25 codebook vectors and train with a batch size of 256 for 20 epochs using the Adam (Kingma and Ba, 2015) optimizer with default parameters and a learning rate of 0.001. At inference time, we sample S = 100 times for each time point and feed these samples in parallel via the batch dimension autoregressively through the decoder to produce the reported metrics.