PlaceNet: Neural Spatial Representation Learning with Multimodal Attention

Authors: Chung-Yeon Lee, Youngjae Yoo, Byoung-Tak Zhang

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train the proposed method on a large-scale multimodal scene dataset consisting of 120 million indoor scenes, and demonstrate that Place Net successfully generalizes to various environments with lower training loss, higher image quality and structural similarity of predicted scenes, compared to a competitive baseline model.
Researcher Affiliation Collaboration Chung-Yeon Lee1,2 , Youngjae Yoo1 , Byoung-Tak Zhang1,3 1Seoul National University, 2Surromind, 3AIIS
Pseudocode No The paper describes its methods using diagrams and mathematical formulations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Available at: https://github.com/jamixlee/placenet
Open Datasets No To train Place Net, we built a large-scale dataset consisting of about 120 million complex indoor scene images . We collected the scene and pose information an agent encounters while traversing the realistic 3D virtual houses, previously introduced by [Song et al., 2017], and the House3D simulator [Wu et al., 2018]. (No direct access information for their specific 120M image dataset is provided).
Dataset Splits Yes Our dataset consists of 115,781 training samples and 10,081 evaluation samples extracted from the 25,071 houses as the training dataset, but from different viewpoints. We further generated 6,501 samples for evaluation from 929 houses which were not used to make the training dataset. The two evaluation datasets were labeled as seen and unseen in the evaluation phase.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments.
Software Dependencies No The paper mentions using the Adam optimizer but does not specify software dependencies with version numbers (e.g., specific deep learning frameworks like PyTorch or TensorFlow, or their versions).
Experiment Setup Yes The proposed models are implemented with a scene encoder s dimension of 256 channels, and the convolutional LSTM s hidden state of 128 channels. The generation network has 12 layers, and weights are not shared between generation steps for better performance. The number of observations given to the models is determined randomly for the training (maximum of 20), and is fixed to 5 for the evaluation phase. For training, we use the Adam optimizer [Kingma and Ba, 2014] with initial learning rate 5e 4, which linearly decays by a factor of 10 over 1.6M optimizer steps according to the scheduler. Additional hyper-parameters used for the training are shown in the supplementary material B.