Synergistic Dual Spatial-aware Generation of Image-to-text and Text-to-image

Authors: Yu Zhao, Hao Fei, Xiangtai Li, Libo Qin, Jiayi Ji, Hongyuan Zhu, Meishan Zhang, Min Zhang, Jianguo Wei

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly. Further in-depth analysis reveals how our dual learning strategy advances.
Researcher Affiliation Collaboration Yu Zhao1, Hao Fei2 , Xiangtai Li3, Libo Qin4, Jiayi Ji2, Hongyuan Zhu5, Meishan Zhang6, Min Zhang6, Jianguo Wei1 1 Tianjin University 2 National University of Singapore 3 Bytedance 4 Central South University 5 I2R & CFAR, A*STAR 6 Harbin Institute of Technology (Shenzhen)
Pseudocode No No pseudocode or clearly labeled algorithm blocks were found.
Open Source Code Yes We will open source at Github.
Open Datasets Yes To demonstrate the capability of our proposed method for both ST2I and SI2T generation, we conduct experiments on the VSD [95, 97] dataset, which is constructed for visual spatial understanding. ... We follow [72] to take the 3D datasets Matterport3D (MP3D) [7] 3DSSG [78] and CURB-SG [27]...
Dataset Splits No The paper mentions using the VSD dataset for experiments and training on 'aligned 3DSG-Image-Text data' but does not specify exact train/validation/test splits (e.g., percentages or sample counts) needed to reproduce the data partitioning.
Hardware Specification No The paper provides training time (e.g., '20 hours' for 'Overall Training') and model parameters in Table 8, but does not specify the exact hardware used (e.g., GPU/CPU models, memory amounts).
Software Dependencies No We use the pre-trained VQ-VAE of VQ-GAN [14]... For text encoder, we adopt the CLIP model... We adopt the pre-trained GPT-2 [59]... We optimize the framework using Adam W [50]... We follow the default settings of DGAE [54]... While it lists software, it generally lacks explicit version numbers. The checklist mentions 'Python 3.8', but this is for environment and not detailed for key libraries.
Experiment Setup Yes We optimize the framework using Adam W [50] with β1 = 0.9 and β2 = 0.98. The learning rate is set to 5e-5 after 10,000 warmup iterations in the final dual tuning.