Perceiver: General Perception with Iterative Attention

Authors: Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, Joao Carreira

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train the Perceiver architecture on images from Image Net (Deng et al., 2009) (left), video and audio from Audio Set (Gemmeke et al., 2017) (considered both multiand uni-modally) (center), and 3D point clouds from Model Net40 (Wu et al., 2015) (right). Essentially no architectural changes are required to use the model on a diverse range of input data.
Researcher Affiliation Industry 1Deep Mind London, UK. Correspondence to: Andrew Jaegle <drewjaegle@deepmind.com>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We train the Perceiver architecture on images from Image Net (Deng et al., 2009) (left), video and audio from Audio Set (Gemmeke et al., 2017) (considered both multiand uni-modally) (center), and 3D point clouds from Model Net40 (Wu et al., 2015) (right).
Dataset Splits Yes As is standard practice, we evaluate our model and all baselines using the top-1 accuracy on the held-out validation set (the test set is not publicly available).
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models or processor types. It only mentions the software framework: 'All experiments were conducted using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020)'.
Software Dependencies No The paper states 'All experiments were conducted using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020)'. While these are specific frameworks, they do not include precise version numbers for the software dependencies needed for replication.
Experiment Setup Yes We trained models for 120 epochs with an initial learning rate of 0.004, decaying it by a factor of 10 at [84, 102, 114] epochs. The best-performing Perceiver we identified on Image Net attends to the input image 8 times, each time processing the full 50,176-pixel input array using a cross-attend module and a latent Transformer with 6 blocks and one cross-attend module with a single head per block. We used a latent array with 512 indices and 1024 channels, and position encodings generated with 64 bands and a maximum resolution of 224 pixels. On Image Net, we found that models of this size overfit without weight sharing, so we use a model that shares weights for all but the first cross-attend and latent Transformer modules. The resulting model has 45 million parameters, making it comparable in size to convolutional models used on Image Net.