Parts of Speech–Grounded Subspaces in Vision-Language Models

Authors: James Oldfield, Christos Tzelepis, Yannis Panagakis, Mihalis Nicolaou, Ioannis Patras

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.
Researcher Affiliation Academia 1Queen Mary University of London 2National and Kapodistrian University of Athens 3Archimedes/Athena RC 4The Cyprus Institute
Pseudocode No The paper describes the mathematical formulation of the objective function and its closed-form solution but does not present any pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that the authors' source code for their methodology is released or provide a link to their own repository. Table 4 mentions a 'Github' link for 'Open CLIP', but this refers to a tool used, not the code for the paper's specific methodology.
Open Datasets Yes Our labelled data points (with which we compute the closed-form solution of Equation (4)) for these parts of speech are given by the Word Net [16] database. We find the noun submanifold projection to lead to improved zero-shot classification on 14/15 of the datasets considered with CLIP Vi T-B-32 (Image NET, MIT-states, UT Zap. Domain NET, Stanford Cars, Caltech101, Food101, CIFAR10, CIFAR100, Oxford Pets, Flowers102, Caltech256, STL10, MNIST, FER2013)
Dataset Splits No The paper mentions using various datasets for zero-shot classification but does not provide specific training/validation/test dataset splits or their percentages/counts. It refers to a 'baseline zero-shot classification protocol' which implies standard splits, but these are not detailed.
Hardware Specification Yes To run the Paella model, we use a 32GB NVIDIA Tesla V100 GPU.
Software Dependencies No The paper mentions software like CLIP, Paella TTIM, and Open CLIP but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes For all experiments, we use the following 4 parts of speech: nouns, adjectives, verbs, and adverbs. We set λ := 1/2 for all experiments. For all quantitative results, we use the base CLIP Vi T-B-32 model. For all results in this subsection, we project onto a relatively large k := 500 dimensional subspace for PCA, PGA, and the proposed method.