Learning to Compose Visual Relations
Authors: Nan Liu, Shuang Li, Yilun Du, Josh Tenenbaum, Antonio Torralba
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct empirical studies to answer the following questions: (1) Can we learn relational models that can generate and edit complex multi-object scenes when given relational scene descriptions with multiple composed scene relations? (2) Can we use our model to generalize to scenes that are never seen in training? (3) Can we understand the set of relations in a scene and infer semantically equivalent descriptions? To answer these questions, we evaluate the proposed method and baselines on image generation, image editing, and image classification on two main datasets, i.e. CLEVR [19] and i Gibson [41]. |
| Researcher Affiliation | Academia | Nan Liu University of Michigan liunan@umich.edu Shuang Li MIT CSAIL lishuang@mit.edu Yilun Du MIT CSAIL yilundu@mit.edu Joshua B. Tenenbaum MIT CSAIL, BCS, CBMM jbt@mit.edu Antonio Torralba MIT CSAIL torralba@mit.edu |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found in the paper. |
| Open Source Code | No | The paper provides a 'Project page at: https://composevisualrelations.github.io/' but does not explicitly state that source code for the described methodology is released or provide a direct link to a code repository. |
| Open Datasets | Yes | CLEVR. We use 50, 000 pairs of images and relational scene descriptions for training. Each image contains 1 5 objects and each object consists of five different attributes, including color, shape, material, size, and its spatial relation to another object in the same image. There are 9 types of colors, 4 types of shapes, 3 types of materials, 3 types of sizes, and 6 types of relations. i Gibson. On the i Gibson dataset, we use 30,000 pairs of images and relational scene descriptions for training. Each image contains 1 3 objects and each object consists of the same five different types of attributes as the CLEVR dataset. There are 6 types of colors, 5 types of shapes, 4 types of materials, 2 types of sizes, and 4 types of relations. The objects are randomly placed in the scenes. Blocks. On the real-world Blocks dataset, a number of 3,000 pairs of images and relational scene descriptions are used for training. Each image contains 1 4 objects and each object differs in color. We only consider the above and below relations as objects are placed vertically in the form of towers. |
| Dataset Splits | No | The paper describes train and test subsets (1R, 2R, 3R) but does not explicitly mention a separate validation split or how hyperparameters were tuned. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions adopting training code and models from [8] but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | No | The paper describes the training process with contrastive divergence and Langevin dynamics for sampling, but does not provide specific experimental setup details such as learning rates, batch sizes, number of epochs, or optimizer settings. |