Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards
Authors: Xiaoyu Yang, Jie Lu, En Yu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate our method enhances the efficiency and accuracy of image-text alignment in the pre-training of VL models, particularly in the concept drift scenario. Moreover, various downstream tasks exhibit significant improvements in our model s ability to adapt to the long-tailed open world. Furthermore, we create a set of multi-modal datasets called Open MMlo, specifically tailored for the long-tailed open-world setting, to validate our findings. |
| Researcher Affiliation | Academia | Xiaoyu Yang, Jie Lu, En Yu Australian Artificial Intelligence Institute (AAII), Faulty of Engineering and Information Technology, University of Technology Sydney, Australia. EMAIL; EMAIL |
| Pseudocode | No | The paper describes its methodology in text and mathematical formulations (Section 2 Methodology, Section 2.1 MULTI-MODAL CONCEPT DRIFT THEORY, Section 2.2 T-DISTRIBUTED ADAPTER FOR CONCEPT DRIFT, Section 2.3 T-DISTRIBUTED VISION LANGUAGE MODEL FOR THE CONCEPT DRIFT) and includes a workflow diagram (Figure 2), but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | To foster the development of the multi-modal community, we have made both Open MMlo datasets and our code publicly available at: https://github.com/Xiaoyu Young/Concept Drift MLLMs. |
| Open Datasets | Yes | Furthermore, we create a set of multi-modal datasets called Open MMlo, specifically tailored for the long-tailed open-world setting, to validate our findings. To foster the development of the multi-modal community, we have made both Open MMlo datasets and our code publicly available at: https://github.com/Xiaoyu Young/Concept Drift MLLMs. We extend the open-source datasets, namely Image Net-LT Liu et al. (2019), i Natualist2018 Van Horn et al. (2018) and Places-LT Liu et al. (2019). |
| Dataset Splits | Yes | The categories are split into three groups: many-shot (with more than 100 training samples), medium-shot (with 20-100 training samples), and few-shot (with fewer than 20 training samples). The Top-1 accuracies are computed for each group to evaluate the performance of mitigating the bias introduced by the long-tail distribution, respectively. Image Net-LT has 1,000 classes and contains 115.8k samples, with a maximum of 1,280 samples and a minimum of 5 samples for a category. Besides, it consists of 18k images for OOD detection. |
| Hardware Specification | Yes | the pre-training of our vision language model consists of 800,000 steps, executed on 2 × 2 NVIDIA A100 GPUs. |
| Software Dependencies | No | For our language-guided image tokenizer, we leverage the strengths of both BERT Devlin et al. (2019b) and Vi T as our text encoder, text decoder and visual encoder, respectively. We utilize the Adam W optimizer... The paper mentions specific software components like BERT, ViT, and AdamW optimizer, but does not provide specific version numbers for any of them or for supporting libraries like PyTorch or TensorFlow. |
| Experiment Setup | Yes | Table 7: The training hyperparameters of our vision language model. Pre-training: Training Steps 400,000, Warmup Steps 1,000, Optimizer Adam W, Learning Rate 1e-4, Learning Rate Decay Cosine, Adam β (0.9, 0.98), Weight Decay 0.05, Batch Size 50. Fine-tuning: Training Steps 18,000, Warmup Steps 0, Optimizer Adam W, Learning Rate 2e-5, Learning Rate Decay Cosine, Adam β (0.9, 0.98), Weight Decay 0.05, Batch Size 400. |