General-purpose, long-context autoregressive modeling with Perceiver AR
Authors: Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, Joao Carreira, Jesse Engel
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that this architecture produces excellent results on several real-world domains with long-range context: RGB-level images (Section 5.2), tokenized language (Sections 5.3 to 5.5), and audio or symbolic music (Section 5.6). We demonstrate that Perceiver AR can learn to perfectly recognize long-context patterns over distances of at least 100k tokens on a synthetic copy task with known ground-truth structure (Section 5.1.1). |
| Researcher Affiliation | Industry | 1Google Research, Brain Team 2Deep Mind. Correspondence to: Curtis Hawthorne <fjord@google.com>, Andrew Jaegle <drewjaegle@deepmind.com>. |
| Pseudocode | No | See Appendix C for in-depth mathematical description of Perceivers and the Perceiver AR architecture and Appendix E for additional technical details. |
| Open Source Code | Yes | Model code is available at https://github.com/ google-research/perceiver-ar. |
| Open Datasets | Yes | To test this architecture s capabilities in the image modality, we use the downsampled Image Net dataset (van den Oord et al., 2016b) at the 64 64 resolution. |
| Dataset Splits | Yes | After 750k steps, we achieve 3.40 bits/dim on the validation set, exceeding the performance of previous autoregressive models (Table 3). |
| Hardware Specification | Yes | Training and evaluation were done on either TPUv2 or TPUv3 clusters. |
| Software Dependencies | No | We use the Adam optimizer (Kingma & Ba, 2015) as implemented in the Optax framework (Hessel et al., 2020) with b1 = 0.1, b2 = 0.999, eps = 1e 8, a base learning rate of 3e 4, and a 10k step linear warmup. |
| Experiment Setup | Yes | We use the Adam optimizer (Kingma & Ba, 2015) as implemented in the Optax framework (Hessel et al., 2020) with b1 = 0.1, b2 = 0.999, eps = 1e 8, a base learning rate of 3e 4, and a 10k step linear warmup. |