Learning Object Consistency and Interaction in Image Generation from Scene Graphs

Authors: Yangkang Zhang, Chenye Meng, Zejian Li, Pei Chen, Guang Yang, Changyuan Yang, Lingyun Sun

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on COCO-stuff and Visual Genome datasets show our proposed method alleviates the ignorance of objects and outperforms the state-of-the-art on visual fidelity of generated images and objects.
Researcher Affiliation Collaboration Yangkang Zhang1 , Chenye Meng2 , Zejian Li2 , Pei Chen1 , Guang Yang3 , Changyuan Yang3 and Lingyun Sun1 1College of Computer Science and Technology, Zhejiang University, China 2School of Software Technology, Zhejiang University, China 3Alibaba Group
Pseudocode No The paper describes the methods in prose and with diagrams, but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Source code, dataset and supplementary file are available at https://github.com/yangkzz/LOCI.
Open Datasets Yes We validate the proposed LOCI on the COCO-stuff and Visual Genome dataset using the same split of datasets as previous works [Johnson et al., 2018; Zhao et al., 2022].
Dataset Splits Yes We validate the proposed LOCI on the COCO-stuff and Visual Genome dataset using the same split of datasets as previous works [Johnson et al., 2018; Zhao et al., 2022].
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments.
Software Dependencies No The paper mentions various models and frameworks (e.g., YOLOv7, VQGAN, transformer, GAT) by citation, but it does not specify version numbers for general ancillary software or libraries used for implementation.
Experiment Setup Yes The second phase trains the regression of bounding boxes B and a GNN model with the consistency module. For a sampled object with a GT bounding box b and an embedding v given by the GNN, the training minimizes L2 = Lbbox + Lcon where Lbbox = Eb,v b B(v) 2 (6) The third phase trains the mapping from object embeddings to image latent codes with our consistency loss and the interaction module. The training loss is L3 = λ1Lcon + λ2Lce where Lce = Esi log P si | ˆs Ni <i, V (7) Lce is a cross-entropy loss. λ1 = 0.6 and λ2 = 0.4. For an unseen scene graph during sampling, the trained GNN in the second phase gives new object embeddings, and the mapping in the third phase infers new latent codes autoregressively. Accordingly, the decoder in the first phase generates new images. We leverage the multinomial resampling strategy [Jahn et al., 2021] to improve generative diversity.