Synergistic Dual Spatial-aware Generation of Image-to-text and Text-to-image
Authors: Yu Zhao, Hao Fei, Xiangtai Li, Libo Qin, Jiayi Ji, Hongyuan Zhu, Meishan Zhang, Min Zhang, Jianguo Wei
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly. Further in-depth analysis reveals how our dual learning strategy advances. |
| Researcher Affiliation | Collaboration | Yu Zhao1, Hao Fei2 , Xiangtai Li3, Libo Qin4, Jiayi Ji2, Hongyuan Zhu5, Meishan Zhang6, Min Zhang6, Jianguo Wei1 1 Tianjin University 2 National University of Singapore 3 Bytedance 4 Central South University 5 I2R & CFAR, A*STAR 6 Harbin Institute of Technology (Shenzhen) |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found. |
| Open Source Code | Yes | We will open source at Github. |
| Open Datasets | Yes | To demonstrate the capability of our proposed method for both ST2I and SI2T generation, we conduct experiments on the VSD [95, 97] dataset, which is constructed for visual spatial understanding. ... We follow [72] to take the 3D datasets Matterport3D (MP3D) [7] 3DSSG [78] and CURB-SG [27]... |
| Dataset Splits | No | The paper mentions using the VSD dataset for experiments and training on 'aligned 3DSG-Image-Text data' but does not specify exact train/validation/test splits (e.g., percentages or sample counts) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper provides training time (e.g., '20 hours' for 'Overall Training') and model parameters in Table 8, but does not specify the exact hardware used (e.g., GPU/CPU models, memory amounts). |
| Software Dependencies | No | We use the pre-trained VQ-VAE of VQ-GAN [14]... For text encoder, we adopt the CLIP model... We adopt the pre-trained GPT-2 [59]... We optimize the framework using Adam W [50]... We follow the default settings of DGAE [54]... While it lists software, it generally lacks explicit version numbers. The checklist mentions 'Python 3.8', but this is for environment and not detailed for key libraries. |
| Experiment Setup | Yes | We optimize the framework using Adam W [50] with β1 = 0.9 and β2 = 0.98. The learning rate is set to 5e-5 after 10,000 warmup iterations in the final dual tuning. |