PlaceNet: Neural Spatial Representation Learning with Multimodal Attention
Authors: Chung-Yeon Lee, Youngjae Yoo, Byoung-Tak Zhang
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train the proposed method on a large-scale multimodal scene dataset consisting of 120 million indoor scenes, and demonstrate that Place Net successfully generalizes to various environments with lower training loss, higher image quality and structural similarity of predicted scenes, compared to a competitive baseline model. |
| Researcher Affiliation | Collaboration | Chung-Yeon Lee1,2 , Youngjae Yoo1 , Byoung-Tak Zhang1,3 1Seoul National University, 2Surromind, 3AIIS |
| Pseudocode | No | The paper describes its methods using diagrams and mathematical formulations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Available at: https://github.com/jamixlee/placenet |
| Open Datasets | No | To train Place Net, we built a large-scale dataset consisting of about 120 million complex indoor scene images . We collected the scene and pose information an agent encounters while traversing the realistic 3D virtual houses, previously introduced by [Song et al., 2017], and the House3D simulator [Wu et al., 2018]. (No direct access information for their specific 120M image dataset is provided). |
| Dataset Splits | Yes | Our dataset consists of 115,781 training samples and 10,081 evaluation samples extracted from the 25,071 houses as the training dataset, but from different viewpoints. We further generated 6,501 samples for evaluation from 929 houses which were not used to make the training dataset. The two evaluation datasets were labeled as seen and unseen in the evaluation phase. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions using the Adam optimizer but does not specify software dependencies with version numbers (e.g., specific deep learning frameworks like PyTorch or TensorFlow, or their versions). |
| Experiment Setup | Yes | The proposed models are implemented with a scene encoder s dimension of 256 channels, and the convolutional LSTM s hidden state of 128 channels. The generation network has 12 layers, and weights are not shared between generation steps for better performance. The number of observations given to the models is determined randomly for the training (maximum of 20), and is fixed to 5 for the evaluation phase. For training, we use the Adam optimizer [Kingma and Ba, 2014] with initial learning rate 5e 4, which linearly decays by a factor of 10 over 1.6M optimizer steps according to the scheduler. Additional hyper-parameters used for the training are shown in the supplementary material B. |