Neural Multisensory Scene Inference

Authors: Jae Hyun Lim, Pedro O. O. Pinheiro, Negar Rostamzadeh, Chris Pal, Sungjin Ahn

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that the proposed model can efficiently infer robust modality-invariant 3D-scene representations from arbitrary combinations of modalities and perform accurate cross-modal generation. To perform this exploration, we also develop the Multisensory Embodied 3D-Scene Environment (MESE). 4 Experiment The proposed model is evaluated with respect to the following criteria: (i) cross-modal density estimation in terms of log-likelihood, (ii) ability to perform cross-modal sample generation, (iii) quality of learned representation by applying it to a downstream classification task, (iv) robustness to the missing-modality problem, and (v) space and computational cost.
Researcher Affiliation Collaboration Jae Hyun Lim 123, Pedro O. Pinheiro1, Negar Rostamzadeh1, Christopher Pal1234, Sungjin Ahn 5 1Element AI, 2Mila, 3Université de Montréal, 4Polytechnique Montréal, 5Rutgers University
Pseudocode No The paper describes algorithms and model architectures in text and diagrams (Fig. 2, Fig. S2), but does not provide formal pseudocode blocks or algorithm listings.
Open Source Code Yes Code is available at: https://github.com/lim0606/pytorch-generative-multisensory-network
Open Datasets No To evaluate our model we have developed an environment, the Multisensory Embodied 3D-Scene Environment (MESE). MESE integrates Mu Jo Co (Todorov et al., 2012), Mu Jo Co HAPTIX (Kumar & Todorov, 2015), and the Open AI gym (Brockman et al., 2016) for 3D scene understanding through multisensory interactions. The paper describes developing a custom environment (MESE) for experiments, integrating existing simulation tools, but does not provide concrete access (link, DOI, or explicit statement of public availability) to the specific dataset generated or used for training within this environment.
Dataset Splits No For validation dataset, the models are evaluated with the same limited combinations as done in training (valmissing), as well as all combinations (valfull). The paper mentions a "validation dataset" and its use in experiments (e.g., in the Missing-modality Problem section), but it does not specify concrete split percentages, sample counts, or a detailed methodology for creating these splits from the overall data.
Hardware Specification No The system is implemented using PyTorch (Paszke et al., 2017) and run on a single GPU. It uses CuDNN (Chetlur et al., 2014) for fast convolution operations. The training takes about 3 days with a single GPU. The paper mentions using "a single GPU" but does not specify the model (e.g., NVIDIA V100, RTX 3090), CPU, or other specific hardware components.
Software Dependencies No The system is implemented using PyTorch (Paszke et al., 2017) and run on a single GPU. It uses CuDNN (Chetlur et al., 2014) for fast convolution operations. MESE integrates Mu Jo Co (Todorov et al., 2012), Mu Jo Co HAPTIX (Kumar & Todorov, 2015), and the Open AI gym (Brockman et al., 2016). While software components and their authors are cited, specific version numbers (e.g., PyTorch 1.9, CUDA 11.1) are not provided in the text for reproducibility.
Experiment Setup Yes In our experiments, the visual input is 64 64 RGB image and the haptic input is 132-dimension consisting of the hand pose and touch senses. During training, we use both modalities for each sampled scene and use 0 to 15 randomly sampled context query-sense pairs for each modality. For more details on the experimental environments, implementations, and settings, refer to Appendix A. Appendix A states: "All experiments were conducted with a learning rate of 5e-5, Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999, and a batch size of 16. We clip gradients by norm 1.0. The latent variable dimension D was set to 256. The Conv DRAW architecture consists of 5 LSTM layers, each with 256 hidden units. The training takes about 3 days with a single GPU."