Im-Promptu: In-Context Composition from Image Prompts

Authors: Bhishma Dedhia, Michael Chang, Jake Snell, Tom Griffiths, Niraj Jha

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we introduce a suite of three benchmarks to test the generalization properties of a visual in-context learner. We formalize the notion of an analogy-based in-context learner and use it to design a meta-learning framework called Im-Promptu. Whereas the requisite token granularity for language is well established, the appropriate compositional granularity for enabling in-context generalization in visual stimuli is usually unspecified. To this end, we use Im-Promptu to train multiple agents with different levels of compositionality, including vector representations, patch representations, and object slots. Our experiments reveal tradeoffs between extrapolation abilities and the degree of compositionality, with non-compositional representations extending learned composition rules to unseen domains but performing poorly on combinatorial tasks. Patch-based representations require patches to contain entire objects for robust extrapolation. At the same time, object-centric tokenizers coupled with a cross-attention module generate consistent and high-fidelity solutions, with these inductive biases being particularly crucial for compositional generalization. Lastly, we demonstrate a use case of Im-Promptu as an intuitive programming interface for image generation.
Researcher Affiliation Academia Bhishma Dedhia1, Michael Chang2, Jake C. Snell3, Thomas L. Griffiths3,4, Niraj K. Jha1 1Department of Electrical and Computer Engineering, Princeton University 2Department of Computer Science, University of California Berkeley 3Department of Computer Science, Princeton University 4Department of Psychology, Princeton University {bdedhia,js2523,tomg,jha}@princeton.edu, mbchang@berkeley.edu
Pseudocode Yes Algorithm 1 Im-Promptu Learning Algorithm
Open Source Code No We will release the complete dataset of primitive and composite tasks upon the publication of this article.
Open Datasets Yes We create a suite of three benchmarks (Fig. 2) from compositional image creators that include (a) 3D Shapes [25], (b) Bit Moji Faces, and (c) CLEVr Objects [26].
Dataset Splits Yes Each benchmark has a split of primitive training tasks and out-of-distribution test tasks... For example, the object color property in the 3D Shapes benchmark can take 10 unique values and 10 × 9 = 90 source-target combinations. Thus, an agent was trained on only 90 × 0.8 = 72 pairs for object-hue primitives. We made sure that each value in the domain set was shown at least once as either the target or the source.
Hardware Specification No The experiments reported in this article were performed on the computational resources managed and supported by Princeton Research Computing at Princeton University.
Software Dependencies No The paper mentions various software components and models like 'Slot Attention Transformer (SLATE) [7]', 'd VAE', 'Image-GPT [30]', and 'Vision Transformer [60]', but does not specify version numbers for these software dependencies.
Experiment Setup Yes The loss metric L used to train the agents was cross-entropy (CE) loss between the true latent sequence ZD and the predicted solution ˆZD obtained from Image-GPT, i.e., Limpromptu = CE(ZD, ˆZD). In addition to the above loss, the d VAE was trained using the mean-squared error (MSE) loss over the raw pixel space to yield the full loss function L = Limpromptu + MSE(D, ˆD). For inference, answers of the transformer-based agents were sampled from the Image-GPT decoder using top-k nucleus sampling [62]. Hyperparameters for training and inference have been laid out in Appendix E.