Deep Neural Room Acoustics Primitive

Authors: Yuhang He, Anoop Cherian, Gordon Wichern, Andrew Markham

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experiments on both synthetic and realworld datasets, demonstrating superior quality in RIR estimation against closely related methods. To empirically validate the superiority of Deep Ne RAP, we conduct experiments on both synthetic and real-world datasets.
Researcher Affiliation Collaboration Yuhang He 1 Anoop Cherian 2 Gordon Wichern 2 Andrew Markham 1 ... 1Department of Computer Science, University of Oxford, Oxford, UK 2Mitsubishi Electric Research Labs, Cambridge, MA, US.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The detailed network architecture is illustrated in Sec. A.5 in Appendix and the source code is given in supplementary material.
Open Datasets Yes For the former [synthetic dataset], we use the large scale Sound Spaces 2.0 dataset (Chen et al., 2022; Chang et al., 2017) consisting of indoor scenes with an average room area ą 100m2 and enriched with room acoustics. For the latter [real-world dataset], we use the real-world Mesh RIR dataset (Koyama et al., 2021).
Dataset Splits No The paper specifies train/test splits for both synthetic (3000/1000) and real-world (10k/4k) datasets, but does not explicitly mention a validation split.
Hardware Specification Yes We train Deep Ne RAP on A40 GPU with Adam optimizer (Kingma & Ba, 2015) with an initial learning rate 0.0005 but decays at every 50 epochs with decaying rate 0.5.
Software Dependencies No The paper mentions 'We implement Deep Ne RAP in Pytorch' but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes We implement Deep Ne RAP in Pytorch. The detailed network architecture is illustrated in Sec. A.5 in Appendix and the source code is given in supplementary material. We train Deep Ne RAP on A40 GPU with Adam optimizer (Kingma & Ba, 2015) with an initial learning rate 0.0005 but decays at every 50 epochs with decaying rate 0.5. We train all models for 300 epochs. The learnable room acoustic representation M consists of 500 ˆ 500 ˆ 2, which means the grid number is 500 ˆ 500, each entry associates with a learnable feature of size 2. The scale number L 256 and scale resolution r 2.1. The aggregated room acoustic feature representation for one position is 512. The primitive encoder E network consists of 6 multi-layer perceptron (MLP) layer, each of which consists of a fully-connected layer, batch normalization layer and a Re LU activation layer. Each MLP layer s hidden unit number is 512.