AV-Cloud: Spatial Audio Rendering Through Audio-Visual Cloud Splatting
Authors: Mingfei Chen, Eli Shlizerman
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that AV-Cloud surpasses current state-of-the-art accuracy on audio reconstruction, perceptive quality, and acoustic effects on two real-world datasets. AV-Cloud also outperforms previous methods when tested on scenes in the wild .... |
| Researcher Affiliation | Academia | Department of Electrical & Computer Engineering, University of Washington, Seattle, USA Department of Applied Mathematics, University of Washington, Seattle, USA Corresponding author: shlizee@uw.edu |
| Pseudocode | No | The paper includes architectural diagrams and flowcharts (e.g., Figure 1, Figure 2, Figure 3) but no explicit pseudocode or algorithm blocks. |
| Open Source Code | No | Code can be obtained for limited use upon request from the corresponding author. We also aim to release of the code as public repository on the Github upon obtaining involved approvals. |
| Open Datasets | Yes | This work focuses on audio-visual synthesis for real-world scenes. Real-world data present challenges such as background noise and discrepancies in sound propagation compared to simulated environments. We have conducted experiments on the following two real-world datasets. 1) RWAVS Dataset [29]... 2) Replay-NVAS Dataset [46]... |
| Dataset Splits | Yes | RWAVS Dataset [29]: As in [29], we split 80% data as training samples and the rest for validation, with all audio resampled to a frequency of 22050 Hz. ... Replay-NVAS Dataset [46]: 28/6/7 multi-view videos are used for training, validation and testing, respectively. |
| Hardware Specification | Yes | Speed tests are conducted on a Ge Force RTX 2080 Ti, with results averaged over 1000 samples. ... Deployed in our navigation platform, AV-Cloud can achieve over 25 FPS on audio rendering (each sample with 257 spectrogram frequencies and 182 time frames i.e. 0.5s of audio sampled in 44,100 Hz for our experiments) on an Apple M2 chip. |
| Software Dependencies | No | The paper mentions using COLMAP [47] and developing a Web GL-based platform using Java Script and HTML based on [52], but it does not specify version numbers for these software components or other libraries/packages. |
| Experiment Setup | Yes | For all experiments, we utilize COLMAP [47] to derive initial Sf M points and camera pose estimation from the provided videos for each scene. We set the K-Means cluster number N to 256, and initialize each of these anchors with RGB values of the nearest K = 50 points in Section 3.2. We utilize the Short-Time Fourier Transform (STFT) to convert waveform audio into the time-frequency domain, setting the FFT size, window length, and hop length to 512, 512, and 128 respectively, and applying a Hanning window with it. For the Visual-to-Audio Splatting Transformer (Section 3.3), we deploy a 3-layer attention module, set the frequency band number F and embedding dimension C to 257 and 128, respectively. Time Filters (Section 3.4) are developed using distinct 2-layer 1D convolution modules that generate filter kernels and biases, conditioned on the integrated Relative Vector. The convolution for time distribution adjustments has a kernel size of 3 on the frequency domain. ... The Conv2d Layers in the render head s residual unit (Section 3.4) which comprise three stacked Conv2d modules with kernel sizes of 7, 3, and 3, and a hidden channel size of 16. We employ the Adam optimizer for optimization, using an exponentially decaying rate starting from 0.01 and spanning over 100 epochs, with a batch size of 6. |