Perceiver IO: A General Architecture for Structured Inputs & Outputs

Authors: Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J Henaff, Matthew Botvinick, Andrew Zisserman, Oriol Vinyals, Joao Carreira

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and Star Craft II. ... To probe the generality of Perceiver IO, we evaluate it on several domains including language understanding (Wikipedia+C4 masked language modeling), visual understanding (Sintel/KITTI optical flow and Image Net classification), multi-modal (Kinetics autoencoding and Audio Set classification) & multi-task settings (multi-task GLUE), and symbolic representations for games (Star Craft II).
Researcher Affiliation Industry Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira ... All experiments were conducted using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020).
Pseudocode No No pseudocode or clearly labeled algorithm blocks were found.
Open Source Code No The paper does not provide an explicit statement or link for the open-sourcing of the Perceiver IO code. It mentions using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020), which are tools used, not the code for their specific method.
Open Datasets Yes language understanding (Wikipedia+C4 masked language modeling), visual understanding (Sintel/KITTI optical flow and Image Net classification), multi-modal (Kinetics autoencoding and Audio Set classification) & multi-task settings (multi-task GLUE), and symbolic representations for games (Star Craft II). ... We pretrain on the Masked Language Modeling (MLM) task proposed in Devlin et al. (2019) using a large text corpus obtained by combining English Wikipedia and C4 (Raffel et al., 2020).
Dataset Splits Yes We finetune Perceiver IO on the GLUE Benchmark Wang et al. (2019), reporting the best performance on the dev set for a fixed size sweep of finetuning hyperparameters.
Hardware Specification Yes All experiments were conducted using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020). ... We use a batch size of 1024 and 64 TPUs. ... Our most expensive model achieves approximately 0.8 frames/sec on a 2017 TITAN Xp, and our lightweight model (with conv downsampling and RAFT-style upsampling) achieves 3.3 frames/sec... On the publicly-available TPU v3, however, our most expensive model achieves 4.4 frames/sec on a single TPU core, and 17.8 frames/sec for the lightweight model.
Software Dependencies No All experiments were conducted using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020). ... An efficient Tensorflow implementation of RAFT (Sun et al., 2020) (received courtesy of the authors) achieves only 1.6 frames/sec on the same hardware. The paper mentions software like JAX and TensorFlow but does not provide specific version numbers.
Experiment Setup Yes We finetune Perceiver IO on the GLUE Benchmark Wang et al. (2019), reporting the best performance on the dev set for a fixed size sweep of finetuning hyperparameters. ... We use LAMB with a simple learning rate schedule consisting of a flat learning rate of 2 10 3 for 55 epochs, after which the learning rate is decayed to 0 over the final 55 epochs following a cosine decay schedule (Loshchilov & Hutter, 2017). We use a batch size of 1024 and 64 TPUs.