reproducibilityindex.ai

ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation

Authors: Zhengzhe Liu, Peng Dai, Ruihui Li, XIAOJUAN QI, Chi-Wing Fu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results manifest that our approach outperforms the state-of-the-arts and our baselines in terms of fidelity and consistency with text. Further, our approach can stylize the generated shapes with both realistic and fantasy structures and textures. Codes are available at https://github.com/liuzhengzhe/ISS-Image-as-Stepping-Stone-for-Text-Guided-3D-Shape-Generation.
Researcher Affiliation	Academia	Zhengzhe Liu1 Peng Dai2 Ruihui Li3 Xiaojuan Qi2 Chi-Wing Fu1 1The Chinese University of Hong Kong 2The University of Hong Kong 3Hunan University
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Codes are available at https://github.com/liuzhengzhe/ISS-Image-as-Stepping-Stone-for-Text-Guided-3D-Shape-Generation.
Open Datasets	Yes	With multi-view RGB or RGBD images and camera poses, we can train ISS on the synthetic dataset Shape Net (Chang et al., 2015) (13 categories) and the real-world dataset CO3D (Reizenstein et al., 2021) (50 categories).
Dataset Splits	No	The paper does not explicitly provide specific training/validation/test dataset splits with percentages, sample counts, or explicit mentions of a validation set.
Hardware Specification	Yes	taking only around 85 seconds on a single Ge Force RTX 3090 Ti
Software Dependencies	No	The paper mentions various software components and models (e.g., CLIP, DVR) but does not provide specific version numbers for any of the libraries or frameworks used.
Experiment Setup	Yes	As shown in Figure 2 (b), given input text T, we replace image encoder EI with text encoder ET to extract CLIP text feature f T, then fine-tune M with CLIP consistency loss between input text T and m images {Ri}m i=1 rendered with random camera poses from output shape S; see Equation 3: ... In stage-2 alignment, we still adopt Lbg to enhance the model s foreground awareness. Comparing Figures 4 (a) and (b), we can see that the stage-2 alignment is able to find a rough shape with Lbg in around five iterations, yet failing to produce a reasonable output without Lbg, since having the same color prediction on both foreground and background hinders the object awareness of the model. Thanks to the joint text-image embedding of CLIP, the gap between text feature f T and shape feature f S has already been largely narrowed by M. Therefore, the stage-2 alignment only needs to fine-tune M with 20 iterations using the input text, taking only around 85 seconds on a single Ge Force RTX 3090 Ti, compared with 72 minutes taken by Dream Fields (Jain et al., 2022) at test time.