reproducibilityindex.ai

General-purpose, long-context autoregressive modeling with Perceiver AR

Authors: Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, Joao Carreira, Jesse Engel

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that this architecture produces excellent results on several real-world domains with long-range context: RGB-level images (Section 5.2), tokenized language (Sections 5.3 to 5.5), and audio or symbolic music (Section 5.6). We demonstrate that Perceiver AR can learn to perfectly recognize long-context patterns over distances of at least 100k tokens on a synthetic copy task with known ground-truth structure (Section 5.1.1).
Researcher Affiliation	Industry	1Google Research, Brain Team 2Deep Mind. Correspondence to: Curtis Hawthorne <fjord@google.com>, Andrew Jaegle <drewjaegle@deepmind.com>.
Pseudocode	No	See Appendix C for in-depth mathematical description of Perceivers and the Perceiver AR architecture and Appendix E for additional technical details.
Open Source Code	Yes	Model code is available at https://github.com/ google-research/perceiver-ar.
Open Datasets	Yes	To test this architecture s capabilities in the image modality, we use the downsampled Image Net dataset (van den Oord et al., 2016b) at the 64 64 resolution.
Dataset Splits	Yes	After 750k steps, we achieve 3.40 bits/dim on the validation set, exceeding the performance of previous autoregressive models (Table 3).
Hardware Specification	Yes	Training and evaluation were done on either TPUv2 or TPUv3 clusters.
Software Dependencies	No	We use the Adam optimizer (Kingma & Ba, 2015) as implemented in the Optax framework (Hessel et al., 2020) with b1 = 0.1, b2 = 0.999, eps = 1e 8, a base learning rate of 3e 4, and a 10k step linear warmup.
Experiment Setup	Yes	We use the Adam optimizer (Kingma & Ba, 2015) as implemented in the Optax framework (Hessel et al., 2020) with b1 = 0.1, b2 = 0.999, eps = 1e 8, a base learning rate of 3e 4, and a 10k step linear warmup.