Learning 3D Photography Videos via Self-supervised Diffusion on Single Images

Authors: Xiaodong Wang, Chenfei Wu, Shengming Yin, Minheng Ni, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Fan Yang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on real datasets show that our method achieves competitive results with existing SOTA methods. Experiments on real datasets validate the effectiveness of our models.
Researcher Affiliation Collaboration 1Peking University 2Microsoft {wangxiaodong21s@stu, fangyj@ss}.pku.edu.cn, {chewu, v-sheyin, t-mni, jianfw, Lindsey.Li, zhengyang, fanyang, lijuanw, zliu, nanduan}@microsoft.com
Pseudocode No The paper includes diagrams (Fig. 2, Fig. 3) to illustrate the architecture and process, but it does not provide any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not contain an explicit statement offering the source code for the described methodology, nor does it provide a direct link to a code repository. The supplementary material link is to a PDF, not a code repository.
Open Datasets Yes For a fair comparison with state-of-the-art methods, we evaluate synthesis results on two datasets, Real Estate10k (RE10K) [Zhou et al., 2018]... and Mannequin Challenge (MC) [Li et al., 2019]... To validate the effectiveness of our M-UNet, we also evaluate the image outpainting in COCO [Caesar et al., 2018].
Dataset Splits No The paper specifies training and test set sizes for COCO (117266 training images and 4952 test images) and mentions test sets for RE10K and MC, but it does not explicitly define a separate validation split or its size for any dataset.
Hardware Specification Yes Table 4: Inference time for 1 frame on NVIDIA s V100 GPU.
Software Dependencies No The paper mentions using Mi Da S [Ranftl et al., 2022] for depth estimation and that their model loaded weights from Stable-diffusion, but it does not provide specific version numbers for these or other key software components or libraries required for reproduction.
Experiment Setup Yes We use the same intrinsic matrices, source camera poses and target camera poses for all methods. Specifically, we choose the first frame (t=1) from each test clip as the source view and consider the fifth (t=5) frame and tenth (t=10) frames as target views. We randomly drop 10% text prompts during training. Tab. 4 illustrates a 20-step diffusion process.