Perceptual Group Tokenizer: Building Perception with Iterative Grouping

Authors: Zhiwei Deng, Ting Chen, Yang Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the representation learned by our model on standard benchmarks based on the Image Net-1K dataset. We also explore and analyze the design space of perceptual group tokenizer in section 4.2, investigate its adaptive computation ability in section 4.3, demonstrate its generalization ability on semantic segmentation in section 4.4, and visualize learned attentions in section 4.5.
Researcher Affiliation Industry Zhiwei Deng, Ting Chen, and Yang Li Google Research and Deepmind
Pseudocode Yes Algorithm 1 Multi-grouping operation using G.
Open Source Code No The paper does not include any explicit statement about releasing the source code or a direct link to a code repository for the described methodology.
Open Datasets Yes The widely-adopted standard benchmark for evaluating self-supervised learning methods is Image Net ILSVRC-2012 (Image Net-1K) (Russakovsky et al., 2015).
Dataset Splits No The paper mentions evaluating on ImageNet-1K with linear probe evaluation protocol but does not explicitly provide specific training/validation/test dataset split percentages or sample counts for reproduction.
Hardware Specification Yes The model is optimized using Adam W (Loshchilov & Hutter, 2018) with learning rate 0.0005 and 1024 batch size for 600 epochs, trained with TPUv5 for 21k core hrs (512 cores for 41 hrs).
Software Dependencies No The paper mentions the use of 'Adam W' optimizer but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes The model is optimized using Adam W (Loshchilov & Hutter, 2018) with learning rate 0.0005 and 1024 batch size for 600 epochs, trained with TPUv5 for 21k core hrs (512 cores for 41 hrs). We use 4 4 patches as image tokens. Three grouping blocks are used, with 10 grouping layers in each block. The dimension for input token is 384, with 256 group tokens per layer. The dimensions for group tokens are 98, 192, and 288 for the three blocks, respectively. There are 6 grouping heads used. For number of grouping iterations, we observe three rounds are sufficient to achieve good performance. The MLP hidden size for each layer is 384 as well, i.e., the MLP multiplication factor is 1. The final multihead attention layer uses a learnable token with 2048 dimensions to summarize all group tokens outputs from the model.