Kernel Multimodal Continuous Attention

Authors: Alexander Moreno, Zhenke Wu, Supriya Nagesh, Walter Dempsey, James M. Rehg

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that kernel continuous attention often outperforms unimodal continuous attention, and the sparse variant tends to highlight time series peaks.
Researcher Affiliation Collaboration Alexander Moreno Luminous Computing Zhenke Wu University of Michigan Supriya Nagesh Georgia Tech Walter Dempsey University of Michigan James M. Rehg Georgia Tech
Pseudocode Yes Algorithm 1 Continuous Attention Mechanism via Kernel Deformed Exponential Families
Open Source Code Yes Code is in our repository3, where we discuss the flags used to control precision on recent GPUs and Pytorch versions. 3https://github.com/onenoc/kernel-continuous-attention
Open Datasets Yes We analyze u Wave [19]: accelerometer time series with eight gesture classes. We follow [16] s split into 3,582 training observations and 896 test observations: sequences have length 945. ... We extend [23] s code5 for IMDB sentiment classification [20]. This uses a document representation v from a convolutional network and an LSTM attention model. ... 5[23] s repository for this dataset is https://github.com/deep-spin/quati
Dataset Splits No For u Wave, it states 'We follow [16] s split into 3,582 training observations and 896 test observations'. It specifies training and test splits, but no explicit validation split is mentioned with specific counts or percentages.
Hardware Specification Yes Our UWave and ECG experiments were done on a Titan X GPU, IMDB on a 1080, and Ford A on an A40.
Software Dependencies Yes As an example, Figure 8 was done on a Titan X with an older version of Pytorch, while Figure 7 was done with an A40 with Pytorch 1.12.
Experiment Setup Yes For u Wave: 'All methods use 100 attention heads. Gaussian mixture uses 100 components (and thus 300 parameters per head), and kernel methods use 256 inducing points.' For IMDB: 'N: basis functions, I = 10 inducing points, bandwidth 0.01.'