Learning Object Consistency and Interaction in Image Generation from Scene Graphs
Authors: Yangkang Zhang, Chenye Meng, Zejian Li, Pei Chen, Guang Yang, Changyuan Yang, Lingyun Sun
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on COCO-stuff and Visual Genome datasets show our proposed method alleviates the ignorance of objects and outperforms the state-of-the-art on visual fidelity of generated images and objects. |
| Researcher Affiliation | Collaboration | Yangkang Zhang1 , Chenye Meng2 , Zejian Li2 , Pei Chen1 , Guang Yang3 , Changyuan Yang3 and Lingyun Sun1 1College of Computer Science and Technology, Zhejiang University, China 2School of Software Technology, Zhejiang University, China 3Alibaba Group |
| Pseudocode | No | The paper describes the methods in prose and with diagrams, but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code, dataset and supplementary file are available at https://github.com/yangkzz/LOCI. |
| Open Datasets | Yes | We validate the proposed LOCI on the COCO-stuff and Visual Genome dataset using the same split of datasets as previous works [Johnson et al., 2018; Zhao et al., 2022]. |
| Dataset Splits | Yes | We validate the proposed LOCI on the COCO-stuff and Visual Genome dataset using the same split of datasets as previous works [Johnson et al., 2018; Zhao et al., 2022]. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments. |
| Software Dependencies | No | The paper mentions various models and frameworks (e.g., YOLOv7, VQGAN, transformer, GAT) by citation, but it does not specify version numbers for general ancillary software or libraries used for implementation. |
| Experiment Setup | Yes | The second phase trains the regression of bounding boxes B and a GNN model with the consistency module. For a sampled object with a GT bounding box b and an embedding v given by the GNN, the training minimizes L2 = Lbbox + Lcon where Lbbox = Eb,v b B(v) 2 (6) The third phase trains the mapping from object embeddings to image latent codes with our consistency loss and the interaction module. The training loss is L3 = λ1Lcon + λ2Lce where Lce = Esi log P si | ˆs Ni <i, V (7) Lce is a cross-entropy loss. λ1 = 0.6 and λ2 = 0.4. For an unseen scene graph during sampling, the trained GNN in the second phase gives new object embeddings, and the mapping in the third phase infers new latent codes autoregressively. Accordingly, the decoder in the first phase generates new images. We leverage the multinomial resampling strategy [Jahn et al., 2021] to improve generative diversity. |