Perceptual Group Tokenizer: Building Perception with Iterative Grouping
Authors: Zhiwei Deng, Ting Chen, Yang Li
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the representation learned by our model on standard benchmarks based on the Image Net-1K dataset. We also explore and analyze the design space of perceptual group tokenizer in section 4.2, investigate its adaptive computation ability in section 4.3, demonstrate its generalization ability on semantic segmentation in section 4.4, and visualize learned attentions in section 4.5. |
| Researcher Affiliation | Industry | Zhiwei Deng, Ting Chen, and Yang Li Google Research and Deepmind |
| Pseudocode | Yes | Algorithm 1 Multi-grouping operation using G. |
| Open Source Code | No | The paper does not include any explicit statement about releasing the source code or a direct link to a code repository for the described methodology. |
| Open Datasets | Yes | The widely-adopted standard benchmark for evaluating self-supervised learning methods is Image Net ILSVRC-2012 (Image Net-1K) (Russakovsky et al., 2015). |
| Dataset Splits | No | The paper mentions evaluating on ImageNet-1K with linear probe evaluation protocol but does not explicitly provide specific training/validation/test dataset split percentages or sample counts for reproduction. |
| Hardware Specification | Yes | The model is optimized using Adam W (Loshchilov & Hutter, 2018) with learning rate 0.0005 and 1024 batch size for 600 epochs, trained with TPUv5 for 21k core hrs (512 cores for 41 hrs). |
| Software Dependencies | No | The paper mentions the use of 'Adam W' optimizer but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | The model is optimized using Adam W (Loshchilov & Hutter, 2018) with learning rate 0.0005 and 1024 batch size for 600 epochs, trained with TPUv5 for 21k core hrs (512 cores for 41 hrs). We use 4 4 patches as image tokens. Three grouping blocks are used, with 10 grouping layers in each block. The dimension for input token is 384, with 256 group tokens per layer. The dimensions for group tokens are 98, 192, and 288 for the three blocks, respectively. There are 6 grouping heads used. For number of grouping iterations, we observe three rounds are sufficient to achieve good performance. The MLP hidden size for each layer is 384 as well, i.e., the MLP multiplication factor is 1. The final multihead attention layer uses a learnable token with 2048 dimensions to summarize all group tokens outputs from the model. |