Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models
Authors: Zhihe Lu, Jiawang Bai, Xin Li, Zeyu Xiao, Xinchao Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we introduce the datasets and evaluation on zero-shot, base-to-new and cross-dataset generalization. All experiments are conducted on 11 diverse datasets, and the quantitative evaluation metric is classification accuracy. We also provide additional experimental results in the Appendix A section. |
| Researcher Affiliation | Academia | 1National University of Singapore 2Tsinghua University 3University of Science and Technology of China. Correspondence to: Xinchao Wang <xinchao@nus.edu.sg>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/zhihe Lu/Ensemble VLM.git. |
| Open Datasets | Yes | Following prior research (Zhou et al., 2022b;a; Khattak et al., 2023a;b), we employ a set of 11 diverse datasets, covering a large range of recognition tasks. Specifically, the benchmark comprises the following datasets: (i) Image Net (Deng et al., 2009) and Caltech101 (Fei-Fei et al., 2004) for generic object classification; (ii) Oxford Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), and FGVCAircraft (Maji et al., 2013) for fine-grained classification; (iii) SUN397 (Xiao et al., 2010) for scene recognition; (iv) UCF101 (Soomro et al., 2012) for action recognition; (v) DTD (Cimpoi et al., 2014) for texture classification; (vi) Euro SAT (Helber et al., 2019) for satellite imagery recognition. |
| Dataset Splits | No | The paper describes training on a 16-shot training set from base classes and evaluating on base and new classes from the test set, but it does not explicitly mention a separate validation split for hyperparameter tuning or early stopping. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions models like CLIP and ALIGN but does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For the tuning ensemble, we set the initial learning rate to 5e 3 and utilize the same adjusting scheduler as in (Zhou et al., 2022a; Khattak et al., 2023a;b). The sample-aware weight generator is a two-layer MLP (fdim fdim/32 and fdim/32 numweight), which is trained for 5 epochs with a batch size of 128. |