ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation

Authors: Zhengzhe Liu, Peng Dai, Ruihui Li, XIAOJUAN QI, Chi-Wing Fu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results manifest that our approach outperforms the state-of-the-arts and our baselines in terms of fidelity and consistency with text. Further, our approach can stylize the generated shapes with both realistic and fantasy structures and textures. Codes are available at https://github.com/liuzhengzhe/ISS-Image-as-Stepping-Stone-for-Text-Guided-3D-Shape-Generation.
Researcher Affiliation Academia Zhengzhe Liu1 Peng Dai2 Ruihui Li3 Xiaojuan Qi2 Chi-Wing Fu1 1The Chinese University of Hong Kong 2The University of Hong Kong 3Hunan University
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Codes are available at https://github.com/liuzhengzhe/ISS-Image-as-Stepping-Stone-for-Text-Guided-3D-Shape-Generation.
Open Datasets Yes With multi-view RGB or RGBD images and camera poses, we can train ISS on the synthetic dataset Shape Net (Chang et al., 2015) (13 categories) and the real-world dataset CO3D (Reizenstein et al., 2021) (50 categories).
Dataset Splits No The paper does not explicitly provide specific training/validation/test dataset splits with percentages, sample counts, or explicit mentions of a validation set.
Hardware Specification Yes taking only around 85 seconds on a single Ge Force RTX 3090 Ti
Software Dependencies No The paper mentions various software components and models (e.g., CLIP, DVR) but does not provide specific version numbers for any of the libraries or frameworks used.
Experiment Setup Yes As shown in Figure 2 (b), given input text T, we replace image encoder EI with text encoder ET to extract CLIP text feature f T, then fine-tune M with CLIP consistency loss between input text T and m images {Ri}m i=1 rendered with random camera poses from output shape S; see Equation 3: ... In stage-2 alignment, we still adopt Lbg to enhance the model s foreground awareness. Comparing Figures 4 (a) and (b), we can see that the stage-2 alignment is able to find a rough shape with Lbg in around five iterations, yet failing to produce a reasonable output without Lbg, since having the same color prediction on both foreground and background hinders the object awareness of the model. Thanks to the joint text-image embedding of CLIP, the gap between text feature f T and shape feature f S has already been largely narrowed by M. Therefore, the stage-2 alignment only needs to fine-tune M with 20 iterations using the input text, taking only around 85 seconds on a single Ge Force RTX 3090 Ti, compared with 72 minutes taken by Dream Fields (Jain et al., 2022) at test time.