TOSS: High-quality Text-guided Novel View Synthesis from a Single Image

Authors: Yukai Shi, Jianan Wang, He CAO, Boshi Tang, Xianbiao Qi, Tianyu Yang, Yukun Huang, Shilong Liu, Lei Zhang, Heung-Yeung Shum

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments are conducted with results showing that our proposed TOSS outperforms Zero123 with more plausible, controllable and multiview-consistent NVS results.
Researcher Affiliation Collaboration 1 Tsinghua University 2 Hong Kong University of Science and Technology 3 International Digital Economy Academy (IDEA)
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an unambiguous statement or a direct link to the source code for the TOSS methodology itself. It mentions using a third-party framework (threestudio) which has a public repository, but not their own implementation.
Open Datasets Yes We employ an automated captioning procedure similar to that of Cap3D (Luo et al., 2023) but alter its final-stage caption fusion strategy slightly to caption Objaverse (Deitke et al., 2023)... We evaluate on GSO (Downs et al., 2022) and RTMV (Tremblay et al., 2022) for NVS quality in Tab. 1.
Dataset Splits Yes For each 3D instance, we randomly sample 12 views for training. [...] Furthermore, we evaluate the 3D consistency scores following (Watson et al., 2022). Specifically, we randomly sample 100 camera poses in the whole sphere, with 80 used for training and 20 for testing.
Hardware Specification Yes We train 120K steps with a total batch size of 2048 on 8 A100 GPUs which takes about 7 days.
Software Dependencies Yes We initialize TOSS with pre-trained Stable Diffusion v1.5 with both CLIP encoder and VAE encoder frozen.
Experiment Setup Yes We train 120K steps with a total batch size of 2048 on 8 A100 GPUs which takes about 7 days. For training with expert denoisers, we first train 60K steps for all noise levels, then initialize two expert models from this base model and resume training for 12K and 48K steps respectively for high (timestep 1000-800) and low (timestep 800-0) noise levels. [...] During model training we employ an Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 10 3 for the camera pose encoder and 10 4 for other modules. In all the experiments, we train our model with the 16-bit floating point (fp16) format for efficiency. To enable classifier-free guidance, we randomly mask 50% of the samples for text in each batch and 10% for condition images.