Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Enhancing Vision-Language Model with Unmasked Token Alignment
Authors: Jihao Liu, Jinliang Zheng, Boxiao Liu, Yu Liu, Hongsheng Li
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uniand multi-modal benchmarks. |
| Researcher Affiliation | Collaboration | Jihao Liu EMAIL CUHK MMLab Jinliang Zheng EMAIL Institute for AI Industry Research (AIR) Tsinghua University Boxiao Liu EMAIL Sensetime Research Yu Liu EMAIL Sensetime Research Hongsheng Li EMAIL CUHK MMLab CPII under Inno HK Shanghai AI Laboratory |
| Pseudocode | No | The paper describes the methodology in natural language text and provides diagrams (e.g., Figure 1) to illustrate the architecture and process, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the proposed Unmasked Token Alignment (UTA) method, nor does it provide any links to a code repository. |
| Open Datasets | Yes | All Vi T models are pre-trained on Image Net-21K (Deng et al., 2009) dataset... Then we perform contrastive fine-tuning on the Data Comp-1B dataset (Gadre et al., 2023). Table 2: Zero-shot retrieval performance on Flickr30k (Young et al., 2014) and COCO (Lin et al., 2014). On LLa VA-Bench, we follow the default settings to first train a projection layer on CC-3M dataset (Sharma et al., 2018) for feature alignment and then fine-tune the project layer and Large Language Model (LLM) (Chiang et al., 2023) on LLa VA-Instruct-150K dataset (Liu et al., 2023). For object detection and instance segmentation tasks, we adopt the Cascade Mask R-CNN (He et al., 2017; Cai & Vasconcelos, 2019) framework and separately fine-tune on the COCO (Lin et al., 2014) and LVIS (Gupta et al., 2019) datasets. For semantic segmentation task, we adopt the Uper Net (Xiao et al., 2018) framework and fine-tune on the ADE20K (Zhou et al., 2017) dataset. |
| Dataset Splits | No | The paper mentions several datasets (e.g., ImageNet-21K, Data Comp-1B, COCO, LVIS, ADE20K) and describes pre-training and fine-tuning on them, including specifying input resolutions and batch sizes. However, it does not explicitly provide specific details about how these datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or citations to specific split methodologies), relying on implied standard splits for benchmarks. |
| Hardware Specification | No | The paper states that training CLIP models from scratch "requires a lot of computing resources" and discusses "training FLOPs," but it does not specify any particular hardware components such as GPU models, CPU types, or cloud computing instances used for its experiments. |
| Software Dependencies | No | The paper mentions optimizers like "Adam W (Loshchilov & Hutter, 2017)" and "LAMB (You et al., 2019)" and various models/frameworks (Vision Transformer, CLIP, LLaVA, Cascade Mask R-CNN, UperNet), but it does not specify version numbers for any libraries, programming languages, or other software tools used to implement and run the experiments. |
| Experiment Setup | Yes | Pre-training. All Vi T models are pre-trained on Image Net-21K (Deng et al., 2009) dataset using 224 224 input resolution and patch size of 14. Unless otherwise specified, we pre-train for 150 epochs with batch size of 4096. We use Adam W (Loshchilov & Hutter, 2017) optimizer with weight decay of 0.05. The learning rate is linearly increased to 1.5 10-3 with 1 epoch of training and decays to 10-5 with cosine schedule (Loshchilov & Hutter, 2016). By default, we use reversed block-wise masking with mask ratios of 0.4 and 0.5 for base and large models, respectively. Contrastive fine-tuning on Data Comp-1B. We initialize the model with pre-trained Vi T encoder and CLIP text encoder and fix the temperature value in CLIP loss to 0.01. We use a total batch size of 49,152 for fine-tuning. Following Sun et al. (2023), we use LAMB (You et al., 2019) optimizer with peak learning rate of 2 10-4 and 4 10-4 for base and large models respectively. We use layer-wise learning rate for fine-tuning and set the decay rate to 0.75 and 0.85 for base and large models respectively. The weight decay is set to 0.05 for all models. We use cosine learning rate schedule and decay the learning rate to 0. Fine-tuning and evaluation with LLa VA. Stage 1: Feature alignment... We use Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 2 10 3. The learning rate is linearly warmed up for the first 150 iterations and decayed to 0 with cosine schedule. We use a batch size of 128 and apply no weight decay. Stage 2: End-to-end fine-tuning... changing the batch size to 32 and setting the learning rate to 2 10 5. Object detection and segmentation. On COCO, we use batch size of 128 and fine-tune for 60k iterations. We use learning rate of 5 10 5/6 10 5, drop path rate (Huang et al., 2016) of 0.1/0.4, layer-wise decay rate of 0.7/0.8 for base/large models. On LVIS, we use batch size of 64 and fine-tune for 100k iterations. The learning rate is set to 10 4. The drop path rate and layer-wise decay rate are the same as those used on COCO. We adopt the Uper Net (Xiao et al., 2018) framework for semantic segmentation on ADE20K (Zhou et al., 2017). In particular, we use batch size of 32 and fine-tune for 60k iterations. We use learning rate of 6 10 5/4 10 5, drop path rate of 0.15/0.2, layer-wise decay rate of 0.85/0.9 for base/large models. |