Text-To-Concept (and Back) via Cross-Model Alignment
Authors: Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, Soheil Feizi
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We observe that the mapping between an image s representation in one model to its representation in another can be learned surprisingly well with just a linear layer, even across diverse models. Building on this observation, we propose text-to-concept, where features from a fixed pretrained model are aligned linearly to the CLIP space, so that text embeddings from CLIP s text encoder become directly comparable to the aligned features. With text-to-concept, we convert fixed off-the-shelf vision encoders to surprisingly strong zero-shot classifiers for free, with accuracy at times even surpassing that of CLIP, despite being much smaller models and trained on a small fraction of the data compared to CLIP. We show other immediate use-cases of text-to-concept, like building concept bottleneck models with no concept supervision, diagnosing distribution shifts in terms of human concepts, and retrieving images satisfying a set of text-based constraints. Lastly, we demonstrate the feasibility of concept-to-text, where vectors in a model s feature space are decoded by first aligning to the CLIP before being fed to a GPT-based generative model. |
| Researcher Affiliation | Collaboration | Mazda Moayeri * 1 Keivan Rezaei * 1 Maziar Sanjabi 2 Soheil Feizi 1 ... 1Department of Computer Science, University of University of Maryland 2Meta AI. Correspondence to: Mazda Moayeri <mmoayeri@umd.edu>, Keivan Rezaei <krezaei@umd.edu>. |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. Methods are described in prose and mathematical formulations. |
| Open Source Code | No | For almost all of these models, pretraiend weights are obtained from timm library (Wightman, 2019) and (Ilharco et al., 2021). |
| Open Datasets | Yes | Note that in our experiments, all models except CLIPs are trained on Image Net-1K dataset (Deng et al., 2009) but we evaluate all of them with Image Net. We use RIVAL10 classification (Moayeri et al., 2022) as an example for how a CBM can be implemented with no concept supervision using text-to-concept. |
| Dataset Splits | No | Note that we use Image Net-1K train and test datasets as Dtrain and Dtest in linear alignment. |
| Hardware Specification | No | No specific hardware details (such as GPU or CPU models, or memory specifications) are mentioned for running the experiments. |
| Software Dependencies | Yes | In terms of optimizing A.1, we use SGD optimizer and learning rate scheduler (implemented in Torch (Paszke et al., 2019)) with following hyperparameters: optimizer = optim.SGD(lr=0.01, momentum=0.9, weight decay=5e-4) scheduler = torch.optim.lr scheduler.Cosine Annealing LR(T max=200) |
| Experiment Setup | Yes | In terms of optimizing A.1, we use SGD optimizer and learning rate scheduler (implemented in Torch (Paszke et al., 2019)) with following hyperparameters: optimizer = optim.SGD(lr=0.01, momentum=0.9, weight decay=5e-4) scheduler = torch.optim.lr scheduler.Cosine Annealing LR(T max=200) We run optimization for 6 epochs. |