reproducibilityindex.ai

Learning to Compose Visual Relations

Authors: Nan Liu, Shuang Li, Yilun Du, Josh Tenenbaum, Antonio Torralba

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct empirical studies to answer the following questions: (1) Can we learn relational models that can generate and edit complex multi-object scenes when given relational scene descriptions with multiple composed scene relations? (2) Can we use our model to generalize to scenes that are never seen in training? (3) Can we understand the set of relations in a scene and infer semantically equivalent descriptions? To answer these questions, we evaluate the proposed method and baselines on image generation, image editing, and image classiﬁcation on two main datasets, i.e. CLEVR [19] and i Gibson [41].
Researcher Affiliation	Academia	Nan Liu University of Michigan liunan@umich.edu Shuang Li MIT CSAIL lishuang@mit.edu Yilun Du MIT CSAIL yilundu@mit.edu Joshua B. Tenenbaum MIT CSAIL, BCS, CBMM jbt@mit.edu Antonio Torralba MIT CSAIL torralba@mit.edu
Pseudocode	No	No pseudocode or clearly labeled algorithm blocks were found in the paper.
Open Source Code	No	The paper provides a 'Project page at: https://composevisualrelations.github.io/' but does not explicitly state that source code for the described methodology is released or provide a direct link to a code repository.
Open Datasets	Yes	CLEVR. We use 50, 000 pairs of images and relational scene descriptions for training. Each image contains 1 5 objects and each object consists of ﬁve different attributes, including color, shape, material, size, and its spatial relation to another object in the same image. There are 9 types of colors, 4 types of shapes, 3 types of materials, 3 types of sizes, and 6 types of relations. i Gibson. On the i Gibson dataset, we use 30,000 pairs of images and relational scene descriptions for training. Each image contains 1 3 objects and each object consists of the same ﬁve different types of attributes as the CLEVR dataset. There are 6 types of colors, 5 types of shapes, 4 types of materials, 2 types of sizes, and 4 types of relations. The objects are randomly placed in the scenes. Blocks. On the real-world Blocks dataset, a number of 3,000 pairs of images and relational scene descriptions are used for training. Each image contains 1 4 objects and each object differs in color. We only consider the above and below relations as objects are placed vertically in the form of towers.
Dataset Splits	No	The paper describes train and test subsets (1R, 2R, 3R) but does not explicitly mention a separate validation split or how hyperparameters were tuned.
Hardware Specification	No	The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions adopting training code and models from [8] but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	No	The paper describes the training process with contrastive divergence and Langevin dynamics for sampling, but does not provide specific experimental setup details such as learning rates, batch sizes, number of epochs, or optimizer settings.