reproducibilityindex.ai

Perceiver: General Perception with Iterative Attention

Authors: Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, Joao Carreira

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train the Perceiver architecture on images from Image Net (Deng et al., 2009) (left), video and audio from Audio Set (Gemmeke et al., 2017) (considered both multiand uni-modally) (center), and 3D point clouds from Model Net40 (Wu et al., 2015) (right). Essentially no architectural changes are required to use the model on a diverse range of input data.
Researcher Affiliation	Industry	1Deep Mind London, UK. Correspondence to: Andrew Jaegle <drewjaegle@deepmind.com>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	Yes	We train the Perceiver architecture on images from Image Net (Deng et al., 2009) (left), video and audio from Audio Set (Gemmeke et al., 2017) (considered both multiand uni-modally) (center), and 3D point clouds from Model Net40 (Wu et al., 2015) (right).
Dataset Splits	Yes	As is standard practice, we evaluate our model and all baselines using the top-1 accuracy on the held-out validation set (the test set is not publicly available).
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models or processor types. It only mentions the software framework: 'All experiments were conducted using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020)'.
Software Dependencies	No	The paper states 'All experiments were conducted using JAX (Bradbury et al., 2018) and the Deep Mind JAX ecosystem (Babuschkin et al., 2020)'. While these are specific frameworks, they do not include precise version numbers for the software dependencies needed for replication.
Experiment Setup	Yes	We trained models for 120 epochs with an initial learning rate of 0.004, decaying it by a factor of 10 at [84, 102, 114] epochs. The best-performing Perceiver we identiﬁed on Image Net attends to the input image 8 times, each time processing the full 50,176-pixel input array using a cross-attend module and a latent Transformer with 6 blocks and one cross-attend module with a single head per block. We used a latent array with 512 indices and 1024 channels, and position encodings generated with 64 bands and a maximum resolution of 224 pixels. On Image Net, we found that models of this size overﬁt without weight sharing, so we use a model that shares weights for all but the ﬁrst cross-attend and latent Transformer modules. The resulting model has 45 million parameters, making it comparable in size to convolutional models used on Image Net.