reproducibilityindex.ai

Perceptual Group Tokenizer: Building Perception with Iterative Grouping

Authors: Zhiwei Deng, Ting Chen, Yang Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the representation learned by our model on standard benchmarks based on the Image Net-1K dataset. We also explore and analyze the design space of perceptual group tokenizer in section 4.2, investigate its adaptive computation ability in section 4.3, demonstrate its generalization ability on semantic segmentation in section 4.4, and visualize learned attentions in section 4.5.
Researcher Affiliation	Industry	Zhiwei Deng, Ting Chen, and Yang Li Google Research and Deepmind
Pseudocode	Yes	Algorithm 1 Multi-grouping operation using G.
Open Source Code	No	The paper does not include any explicit statement about releasing the source code or a direct link to a code repository for the described methodology.
Open Datasets	Yes	The widely-adopted standard benchmark for evaluating self-supervised learning methods is Image Net ILSVRC-2012 (Image Net-1K) (Russakovsky et al., 2015).
Dataset Splits	No	The paper mentions evaluating on ImageNet-1K with linear probe evaluation protocol but does not explicitly provide specific training/validation/test dataset split percentages or sample counts for reproduction.
Hardware Specification	Yes	The model is optimized using Adam W (Loshchilov & Hutter, 2018) with learning rate 0.0005 and 1024 batch size for 600 epochs, trained with TPUv5 for 21k core hrs (512 cores for 41 hrs).
Software Dependencies	No	The paper mentions the use of 'Adam W' optimizer but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	The model is optimized using Adam W (Loshchilov & Hutter, 2018) with learning rate 0.0005 and 1024 batch size for 600 epochs, trained with TPUv5 for 21k core hrs (512 cores for 41 hrs). We use 4 4 patches as image tokens. Three grouping blocks are used, with 10 grouping layers in each block. The dimension for input token is 384, with 256 group tokens per layer. The dimensions for group tokens are 98, 192, and 288 for the three blocks, respectively. There are 6 grouping heads used. For number of grouping iterations, we observe three rounds are sufﬁcient to achieve good performance. The MLP hidden size for each layer is 384 as well, i.e., the MLP multiplication factor is 1. The ﬁnal multihead attention layer uses a learnable token with 2048 dimensions to summarize all group tokens outputs from the model.