Parts of Speech–Grounded Subspaces in Vision-Language Models
Authors: James Oldfield, Christos Tzelepis, Yannis Panagakis, Mihalis Nicolaou, Ioannis Patras
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification. |
| Researcher Affiliation | Academia | 1Queen Mary University of London 2National and Kapodistrian University of Athens 3Archimedes/Athena RC 4The Cyprus Institute |
| Pseudocode | No | The paper describes the mathematical formulation of the objective function and its closed-form solution but does not present any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that the authors' source code for their methodology is released or provide a link to their own repository. Table 4 mentions a 'Github' link for 'Open CLIP', but this refers to a tool used, not the code for the paper's specific methodology. |
| Open Datasets | Yes | Our labelled data points (with which we compute the closed-form solution of Equation (4)) for these parts of speech are given by the Word Net [16] database. We find the noun submanifold projection to lead to improved zero-shot classification on 14/15 of the datasets considered with CLIP Vi T-B-32 (Image NET, MIT-states, UT Zap. Domain NET, Stanford Cars, Caltech101, Food101, CIFAR10, CIFAR100, Oxford Pets, Flowers102, Caltech256, STL10, MNIST, FER2013) |
| Dataset Splits | No | The paper mentions using various datasets for zero-shot classification but does not provide specific training/validation/test dataset splits or their percentages/counts. It refers to a 'baseline zero-shot classification protocol' which implies standard splits, but these are not detailed. |
| Hardware Specification | Yes | To run the Paella model, we use a 32GB NVIDIA Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions software like CLIP, Paella TTIM, and Open CLIP but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For all experiments, we use the following 4 parts of speech: nouns, adjectives, verbs, and adverbs. We set λ := 1/2 for all experiments. For all quantitative results, we use the base CLIP Vi T-B-32 model. For all results in this subsection, we project onto a relatively large k := 500 dimensional subspace for PCA, PGA, and the proposed method. |