Continual Transformers: Redundancy-Free Attention for Online Inference

Authors: Lukas Hedegaard, Arian Bakhtiarnia, Alexandros Iosifidis

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our Continual Transformer Encoder with experiments on the THUMOS14, TVSeries and GTZAN datasets with remarkable results: Our Continual one- and two-block architectures reduce the floating point operations per prediction by up to 63 and 2.6 , respectively, while retaining predictive performance. [...] In our experiments, we validate their exceptional efficiency improvements on common benchmarks in Online Action Detection (Idrees et al., 2017) and Online Audio Classification (Tzanetakis et al., 2001).
Researcher Affiliation Academia Lukas Hedegaard, Arian Bakhtiarnia & Alexandros Iosifidis Department of Electrical and Computer Engineering Aarhus University Aarhus, Denmark {lhm,arianbakh,ai}@ece.au.dk
Pseudocode No The paper describes mathematical formulations for attention mechanisms and provides visual diagrams, but it does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Source code: https://github.com/lukashedegaard/continual-transformers.
Open Datasets Yes We validate our Continual Transformer Encoder with experiments on the THUMOS14, TVSeries and GTZAN datasets with remarkable results: [...] The THUMOS14 dataset (Idrees et al., 2017) for OAD has 200 and 213 validation and test videos, respectively [...] For TVSeries (De Geest et al., 2016), the network learns on the train and validations sets (20 videos) and evaluates on the test set (7 videos) [...] We conduct experiments on the Music Genre Classification dataset GTZAN (Tzanetakis & Cook, 2002).
Dataset Splits Yes The THUMOS14 dataset (Idrees et al., 2017) for OAD has 200 and 213 validation and test videos, respectively, with frame-level class annotations across 20 classes. As in prior OAD works, the model is trained on the validation set and evaluated on the test set. [...] For TVSeries (De Geest et al., 2016), the network learns on the train and validations sets (20 videos) and evaluates on the test set (7 videos) as in (Wang et al., 2021). [...] Since there are no predefined splits for GTZAN, we randomly select 10% of the data for validation and 10% for testing.
Hardware Specification Yes We report results using two epochs of training on a Nvidia RTX2080 Ti GPU. [...] All audio classification training procedures were carried out on a single Nvidia RTX 2080 Ti GPU.
Software Dependencies No The paper mentions using FAIRSEQ library (Ott et al., 2019) and MMAction2 (Contributors, 2020) pipeline, but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, TensorFlow, etc.
Experiment Setup Yes Following Wang et al. (2021), we use a batch size of 128, sequence length 64, initial learning rate 10 4 with a factor ten reduction each epoch, alongside weight decay 10 4, and dropout with probability 0.1. [...] A batch size of 64 and the Adam optimizer (Kingma & Ba, 2015) are used with an initial learning rate of 10 4. The learning rate is reduced by a factor of 0.6 on plateau with a tolerance of two epochs, and an early stopping mechanism, where a maximum of 100 epochs are allowed. [...] The Transformer Encoder is trained on the whole temporal sequence using a batch size of 32 and the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 10 5 and a weight decay of 10 4 for 50 epochs.