NeRF-VAE: A Geometry Aware 3D Scene Generative Model

Authors: Adam R Kosiorek, Heiko Strathmann, Daniel Zoran, Pol Moreno, Rosalia Schneider, Sona Mokra, Danilo Jimenez Rezende

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate Ne RF-VAE, we first analyze its ability to reconstruct novel viewpoints given a small number of input views, and contrast that with Ne RF. Second, we compare our model with a Generative Query Network-like autoregressive convolutional model, (Eslami et al., 2018, GQN) and show that while Ne RF-VAE achieves comparably low reconstructions errors, it has a much improved generalization ability, in particular when being evaluated on camera views not seen during training. Third, we provide an ablation study of Ne RF-VAE variants, with a focus on the conditioning mechanisms of the scene function described in Section 3.2. Finally, we showcase samples of Ne RF-VAE. We use three datasets, each consisting of 64 64 coloured images, along with camera position and orientation for each image, and camera parameters used to extract ray position and orientation for each pixel.
Researcher Affiliation Industry 1Deep Mind, London.
Pseudocode No The paper describes processes and architectures but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statements about releasing its own source code or links to a repository.
Open Datasets Yes GQN (Eslami et al., 2018), consists of 200k scenes each with 10 images of rooms with a variable number of objects. Camera positions and orientations are randomly distributed along a plane within the rooms, always facing the horizon. Note that this dataset does not contain reflections or specularities, which we discuss in Section 5.4 We created a custom CLEVR dataset (Johnson et al., 2017) with 100k scenes, each with 10 views. We use the rooms_free_camera_no_object_rotations variant publicly available at https://github.com/deepmind/ gqn-datasets
Dataset Splits No The paper mentions training and testing on datasets, and evaluating on 'held-out scenes', but does not provide explicit numerical or percentage details for a 'validation' dataset split separate from training and testing. While 'validation' is mentioned in context of VAE optimization, it does not refer to a distinct dataset split with specific proportions.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions using 'Adam' optimizer and a 'Res Net' architecture, but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, or other libraries with their specific versions).
Experiment Setup Yes The conditional scene functions architecture follows Ne RF, first processing position x to produce volume density and then additionally receiving orientation d to produce the output colours. Both position and orientation use circular encoding whereby we augment the network input values with a Fourier basis, c. f. Fig. 3. We follow Mildenhall et al. (2020, Section 5.2) in using hierarchical volume sampling in order to approximate the colour of each pixel. This means we maintain a second instance of the conditional scene function (conditioned on the same latent z), which results in an additional likelihood term in the model log-likelihood in Eq. (2). We use Adam (Kingma & Ba, 2014) and β-annealing of the KL term in Eq. (2). Full details can be found in Appendix C.