Visual Object Networks: Image Generation with Disentangled 3D Representations
Authors: Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Josh Tenenbaum, Bill Freeman
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we show that VON produce more realistic image samples than recent 2D deep generative models. We also demonstrate many 3D applications that are enabled by our disentangled representation, including rotating an object, adjusting object shape and texture, interpolating between two objects in texture and shape space independently, and transferring the appearance of a real image to new objects and viewpoints. |
| Researcher Affiliation | Collaboration | Jun-Yan Zhu MIT CSAIL Zhoutong Zhang MIT CSAIL Chengkai Zhang MIT CSAIL Jiajun Wu MIT CSAIL Antonio Torralba MIT CSAIL Joshua B. Tenenbaum MIT CSAIL William T. Freeman MIT CSAIL, Google |
| Pseudocode | No | The paper describes the model architecture and training process in text and diagrams (e.g., Figure 2) but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Please find our implementation at Git Hub. |
| Open Datasets | Yes | We use Shape Net [Chang et al., 2015] for learning to generate 3D shapes. Shape Net is a large shape repository of 55 object categories. Here we use the chair and car categories, which has 6, 777 and 3, 513 CAD models respectively. For 2D datasets, we use the recently released Pix3D dataset to obtain 1, 515 RGB images of chairs alongside with their silhouettes [Sun et al., 2018a], with an addition of 448 clean background images crawled from Google image search. We also crawled 2, 605 images of cars. |
| Dataset Splits | No | The paper states it uses ShapeNet and Pix3D for training but does not provide specific details on how these datasets were split into training, validation, or test sets (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper mentions implementing a 'custom CUDA kernel' which implies the use of a GPU, but it does not specify any particular GPU model, CPU, memory, or other hardware specifications used for running experiments. |
| Software Dependencies | No | The paper mentions various frameworks and algorithms used (e.g., 'Adam solver', 'Res Net encoder-decoder', 'WGAN-GP', 'Cycle GAN') but does not specify their version numbers or any other software dependencies with specific versions (like Python, PyTorch/TensorFlow versions, CUDA versions). |
| Experiment Setup | Yes | We train our models on 128 128 128 shapes (voxels or distance function) and 128 128 3 images. During training, we first train the shape generator Gshape on 3D shape collections and then train the texture generator Gtexture given ground truth 3D shape data and image data. Finally, we fine-tune both modules together. We sample the shape code zshape and texture code ztexture from the standard Gaussian distribution N(0, I), with the code length |zshape| = 200 and |ztexture| = 8. The entire training usually takes two to three days. For hyperparameters, we set λKL = 0.05, λGP = 10, λcyc image = 10, λcyc 2.5D = 25, λcyc texture = 1, and λshape = 0.05. We use the Adam solver [Kingma and Ba, 2015] with a learning rate of 0.0002 for shape generation and 0.0001 for texture generation. |