General-purpose, long-context autoregressive modeling with Perceiver AR

Authors: Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, Joao Carreira, Jesse Engel

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that this architecture produces excellent results on several real-world domains with long-range context: RGB-level images (Section 5.2), tokenized language (Sections 5.3 to 5.5), and audio or symbolic music (Section 5.6). We demonstrate that Perceiver AR can learn to perfectly recognize long-context patterns over distances of at least 100k tokens on a synthetic copy task with known ground-truth structure (Section 5.1.1).
Researcher Affiliation Industry 1Google Research, Brain Team 2Deep Mind. Correspondence to: Curtis Hawthorne <fjord@google.com>, Andrew Jaegle <drewjaegle@deepmind.com>.
Pseudocode No See Appendix C for in-depth mathematical description of Perceivers and the Perceiver AR architecture and Appendix E for additional technical details.
Open Source Code Yes Model code is available at https://github.com/ google-research/perceiver-ar.
Open Datasets Yes To test this architecture s capabilities in the image modality, we use the downsampled Image Net dataset (van den Oord et al., 2016b) at the 64 64 resolution.
Dataset Splits Yes After 750k steps, we achieve 3.40 bits/dim on the validation set, exceeding the performance of previous autoregressive models (Table 3).
Hardware Specification Yes Training and evaluation were done on either TPUv2 or TPUv3 clusters.
Software Dependencies No We use the Adam optimizer (Kingma & Ba, 2015) as implemented in the Optax framework (Hessel et al., 2020) with b1 = 0.1, b2 = 0.999, eps = 1e 8, a base learning rate of 3e 4, and a 10k step linear warmup.
Experiment Setup Yes We use the Adam optimizer (Kingma & Ba, 2015) as implemented in the Optax framework (Hessel et al., 2020) with b1 = 0.1, b2 = 0.999, eps = 1e 8, a base learning rate of 3e 4, and a 10k step linear warmup.