On the Role of Discrete Tokenization in Visual Representation Learning

Authors: Tianqi Du, Yifei Wang, Yisen Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we first present the main empirical results of our proposed Cluster MIM methods on different real-world datasets with different backbones. Then we conduct a series of ablation experiments to discuss the selection of hyperparameters in Cluster MIM.
Researcher Affiliation Academia 1 National Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 School of Mathematical Sciences, Peking University 3 Institute for Artificial Intelligence, Peking University
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/PKU-ML/Cluster MIM.
Open Datasets Yes extensive experiments are conducted on Image Net-100 (Deng et al., 2009) and Image Net-1K (Deng et al., 2009).
Dataset Splits No The paper mentions conducting "linear evaluation and non-linear fine-tuning" on the pretrained encoder and reports "fine-tuning accuracies" and "linear probing accuracies." However, it does not explicitly state the use of a separate validation set or specific splits for training/validation/test data to reproduce the experiment's data partitioning during model development or hyperparameter tuning. It only focuses on evaluation results.
Hardware Specification No The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments. It mentions training time but no hardware details.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes The mask ratio is set to 0.75. On both datasets, we pretrain the model for 200 epochs with batch size 4096 and weight decay 0.05. For the K-Means algorithm used in the tokenizer pretraining stage, we use K-Means++ initialization (Arthur & Vassilvitskii, 2007). We train K-Means for 100 epochs on Image Net-100 and 10 epochs on Image Net-1K.