Text-To-Concept (and Back) via Cross-Model Alignment

Authors: Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, Soheil Feizi

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We observe that the mapping between an image s representation in one model to its representation in another can be learned surprisingly well with just a linear layer, even across diverse models. Building on this observation, we propose text-to-concept, where features from a fixed pretrained model are aligned linearly to the CLIP space, so that text embeddings from CLIP s text encoder become directly comparable to the aligned features. With text-to-concept, we convert fixed off-the-shelf vision encoders to surprisingly strong zero-shot classifiers for free, with accuracy at times even surpassing that of CLIP, despite being much smaller models and trained on a small fraction of the data compared to CLIP. We show other immediate use-cases of text-to-concept, like building concept bottleneck models with no concept supervision, diagnosing distribution shifts in terms of human concepts, and retrieving images satisfying a set of text-based constraints. Lastly, we demonstrate the feasibility of concept-to-text, where vectors in a model s feature space are decoded by first aligning to the CLIP before being fed to a GPT-based generative model.
Researcher Affiliation Collaboration Mazda Moayeri * 1 Keivan Rezaei * 1 Maziar Sanjabi 2 Soheil Feizi 1 ... 1Department of Computer Science, University of University of Maryland 2Meta AI. Correspondence to: Mazda Moayeri <mmoayeri@umd.edu>, Keivan Rezaei <krezaei@umd.edu>.
Pseudocode No No pseudocode or algorithm blocks are present in the paper. Methods are described in prose and mathematical formulations.
Open Source Code No For almost all of these models, pretraiend weights are obtained from timm library (Wightman, 2019) and (Ilharco et al., 2021).
Open Datasets Yes Note that in our experiments, all models except CLIPs are trained on Image Net-1K dataset (Deng et al., 2009) but we evaluate all of them with Image Net. We use RIVAL10 classification (Moayeri et al., 2022) as an example for how a CBM can be implemented with no concept supervision using text-to-concept.
Dataset Splits No Note that we use Image Net-1K train and test datasets as Dtrain and Dtest in linear alignment.
Hardware Specification No No specific hardware details (such as GPU or CPU models, or memory specifications) are mentioned for running the experiments.
Software Dependencies Yes In terms of optimizing A.1, we use SGD optimizer and learning rate scheduler (implemented in Torch (Paszke et al., 2019)) with following hyperparameters: optimizer = optim.SGD(lr=0.01, momentum=0.9, weight decay=5e-4) scheduler = torch.optim.lr scheduler.Cosine Annealing LR(T max=200)
Experiment Setup Yes In terms of optimizing A.1, we use SGD optimizer and learning rate scheduler (implemented in Torch (Paszke et al., 2019)) with following hyperparameters: optimizer = optim.SGD(lr=0.01, momentum=0.9, weight decay=5e-4) scheduler = torch.optim.lr scheduler.Cosine Annealing LR(T max=200) We run optimization for 6 epochs.