Kernel Multimodal Continuous Attention
Authors: Alexander Moreno, Zhenke Wu, Supriya Nagesh, Walter Dempsey, James M. Rehg
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that kernel continuous attention often outperforms unimodal continuous attention, and the sparse variant tends to highlight time series peaks. |
| Researcher Affiliation | Collaboration | Alexander Moreno Luminous Computing Zhenke Wu University of Michigan Supriya Nagesh Georgia Tech Walter Dempsey University of Michigan James M. Rehg Georgia Tech |
| Pseudocode | Yes | Algorithm 1 Continuous Attention Mechanism via Kernel Deformed Exponential Families |
| Open Source Code | Yes | Code is in our repository3, where we discuss the flags used to control precision on recent GPUs and Pytorch versions. 3https://github.com/onenoc/kernel-continuous-attention |
| Open Datasets | Yes | We analyze u Wave [19]: accelerometer time series with eight gesture classes. We follow [16] s split into 3,582 training observations and 896 test observations: sequences have length 945. ... We extend [23] s code5 for IMDB sentiment classification [20]. This uses a document representation v from a convolutional network and an LSTM attention model. ... 5[23] s repository for this dataset is https://github.com/deep-spin/quati |
| Dataset Splits | No | For u Wave, it states 'We follow [16] s split into 3,582 training observations and 896 test observations'. It specifies training and test splits, but no explicit validation split is mentioned with specific counts or percentages. |
| Hardware Specification | Yes | Our UWave and ECG experiments were done on a Titan X GPU, IMDB on a 1080, and Ford A on an A40. |
| Software Dependencies | Yes | As an example, Figure 8 was done on a Titan X with an older version of Pytorch, while Figure 7 was done with an A40 with Pytorch 1.12. |
| Experiment Setup | Yes | For u Wave: 'All methods use 100 attention heads. Gaussian mixture uses 100 components (and thus 300 parameters per head), and kernel methods use 256 inducing points.' For IMDB: 'N: basis functions, I = 10 inducing points, bandwidth 0.01.' |