Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models

Authors: Zhihe Lu, Jiawang Bai, Xin Li, Zeyu Xiao, Xinchao Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we introduce the datasets and evaluation on zero-shot, base-to-new and cross-dataset generalization. All experiments are conducted on 11 diverse datasets, and the quantitative evaluation metric is classification accuracy. We also provide additional experimental results in the Appendix A section.
Researcher Affiliation Academia 1National University of Singapore 2Tsinghua University 3University of Science and Technology of China. Correspondence to: Xinchao Wang <xinchao@nus.edu.sg>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/zhihe Lu/Ensemble VLM.git.
Open Datasets Yes Following prior research (Zhou et al., 2022b;a; Khattak et al., 2023a;b), we employ a set of 11 diverse datasets, covering a large range of recognition tasks. Specifically, the benchmark comprises the following datasets: (i) Image Net (Deng et al., 2009) and Caltech101 (Fei-Fei et al., 2004) for generic object classification; (ii) Oxford Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), and FGVCAircraft (Maji et al., 2013) for fine-grained classification; (iii) SUN397 (Xiao et al., 2010) for scene recognition; (iv) UCF101 (Soomro et al., 2012) for action recognition; (v) DTD (Cimpoi et al., 2014) for texture classification; (vi) Euro SAT (Helber et al., 2019) for satellite imagery recognition.
Dataset Splits No The paper describes training on a 16-shot training set from base classes and evaluating on base and new classes from the test set, but it does not explicitly mention a separate validation split for hyperparameter tuning or early stopping.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions models like CLIP and ALIGN but does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For the tuning ensemble, we set the initial learning rate to 5e 3 and utilize the same adjusting scheduler as in (Zhou et al., 2022a; Khattak et al., 2023a;b). The sample-aware weight generator is a two-layer MLP (fdim fdim/32 and fdim/32 numweight), which is trained for 5 epochs with a batch size of 128.