Perceiver IO: A General Architecture for Structured Inputs & Outputs
Authors: Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J Henaff, Matthew Botvinick, Andrew Zisserman, Oriol Vinyals, Joao Carreira
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and Star Craft II. ... To probe the generality of Perceiver IO, we evaluate it on several domains including language understanding (Wikipedia+C4 masked language modeling), visual understanding (Sintel/KITTI optical flow and Image Net classification), multi-modal (Kinetics autoencoding and Audio Set classification) & multi-task settings (multi-task GLUE), and symbolic representations for games (Star Craft II). |
| Researcher Affiliation | Industry | Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira ... All experiments were conducted using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020). |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of the Perceiver IO code. It mentions using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020), which are tools used, not the code for their specific method. |
| Open Datasets | Yes | language understanding (Wikipedia+C4 masked language modeling), visual understanding (Sintel/KITTI optical flow and Image Net classification), multi-modal (Kinetics autoencoding and Audio Set classification) & multi-task settings (multi-task GLUE), and symbolic representations for games (Star Craft II). ... We pretrain on the Masked Language Modeling (MLM) task proposed in Devlin et al. (2019) using a large text corpus obtained by combining English Wikipedia and C4 (Raffel et al., 2020). |
| Dataset Splits | Yes | We finetune Perceiver IO on the GLUE Benchmark Wang et al. (2019), reporting the best performance on the dev set for a fixed size sweep of finetuning hyperparameters. |
| Hardware Specification | Yes | All experiments were conducted using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020). ... We use a batch size of 1024 and 64 TPUs. ... Our most expensive model achieves approximately 0.8 frames/sec on a 2017 TITAN Xp, and our lightweight model (with conv downsampling and RAFT-style upsampling) achieves 3.3 frames/sec... On the publicly-available TPU v3, however, our most expensive model achieves 4.4 frames/sec on a single TPU core, and 17.8 frames/sec for the lightweight model. |
| Software Dependencies | No | All experiments were conducted using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020). ... An efficient Tensorflow implementation of RAFT (Sun et al., 2020) (received courtesy of the authors) achieves only 1.6 frames/sec on the same hardware. The paper mentions software like JAX and TensorFlow but does not provide specific version numbers. |
| Experiment Setup | Yes | We finetune Perceiver IO on the GLUE Benchmark Wang et al. (2019), reporting the best performance on the dev set for a fixed size sweep of finetuning hyperparameters. ... We use LAMB with a simple learning rate schedule consisting of a flat learning rate of 2 10 3 for 55 epochs, after which the learning rate is decayed to 0 over the final 55 epochs following a cosine decay schedule (Loshchilov & Hutter, 2017). We use a batch size of 1024 and 64 TPUs. |