Im-Promptu: In-Context Composition from Image Prompts
Authors: Bhishma Dedhia, Michael Chang, Jake Snell, Tom Griffiths, Niraj Jha
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we introduce a suite of three benchmarks to test the generalization properties of a visual in-context learner. We formalize the notion of an analogy-based in-context learner and use it to design a meta-learning framework called Im-Promptu. Whereas the requisite token granularity for language is well established, the appropriate compositional granularity for enabling in-context generalization in visual stimuli is usually unspecified. To this end, we use Im-Promptu to train multiple agents with different levels of compositionality, including vector representations, patch representations, and object slots. Our experiments reveal tradeoffs between extrapolation abilities and the degree of compositionality, with non-compositional representations extending learned composition rules to unseen domains but performing poorly on combinatorial tasks. Patch-based representations require patches to contain entire objects for robust extrapolation. At the same time, object-centric tokenizers coupled with a cross-attention module generate consistent and high-fidelity solutions, with these inductive biases being particularly crucial for compositional generalization. Lastly, we demonstrate a use case of Im-Promptu as an intuitive programming interface for image generation. |
| Researcher Affiliation | Academia | Bhishma Dedhia1, Michael Chang2, Jake C. Snell3, Thomas L. Griffiths3,4, Niraj K. Jha1 1Department of Electrical and Computer Engineering, Princeton University 2Department of Computer Science, University of California Berkeley 3Department of Computer Science, Princeton University 4Department of Psychology, Princeton University {bdedhia,js2523,tomg,jha}@princeton.edu, mbchang@berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Im-Promptu Learning Algorithm |
| Open Source Code | No | We will release the complete dataset of primitive and composite tasks upon the publication of this article. |
| Open Datasets | Yes | We create a suite of three benchmarks (Fig. 2) from compositional image creators that include (a) 3D Shapes [25], (b) Bit Moji Faces, and (c) CLEVr Objects [26]. |
| Dataset Splits | Yes | Each benchmark has a split of primitive training tasks and out-of-distribution test tasks... For example, the object color property in the 3D Shapes benchmark can take 10 unique values and 10 × 9 = 90 source-target combinations. Thus, an agent was trained on only 90 × 0.8 = 72 pairs for object-hue primitives. We made sure that each value in the domain set was shown at least once as either the target or the source. |
| Hardware Specification | No | The experiments reported in this article were performed on the computational resources managed and supported by Princeton Research Computing at Princeton University. |
| Software Dependencies | No | The paper mentions various software components and models like 'Slot Attention Transformer (SLATE) [7]', 'd VAE', 'Image-GPT [30]', and 'Vision Transformer [60]', but does not specify version numbers for these software dependencies. |
| Experiment Setup | Yes | The loss metric L used to train the agents was cross-entropy (CE) loss between the true latent sequence ZD and the predicted solution ˆZD obtained from Image-GPT, i.e., Limpromptu = CE(ZD, ˆZD). In addition to the above loss, the d VAE was trained using the mean-squared error (MSE) loss over the raw pixel space to yield the full loss function L = Limpromptu + MSE(D, ˆD). For inference, answers of the transformer-based agents were sampled from the Image-GPT decoder using top-k nucleus sampling [62]. Hyperparameters for training and inference have been laid out in Appendix E. |