Shape and Material from Sound

Authors: Zhoutong Zhang, Qiujia Li, Zhengjia Huang, Jiajun Wu, Josh Tenenbaum, Bill Freeman

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our models on a range of perception tasks: inferring object shape, material, and initial height from sound. We also collect human responses for each task and compare them with model estimates. Our results indicate that first, humans are quite successful in these tasks; second, our model not only closely matches human successes, but also makes similar errors as humans do. For these quantitative evaluations, we have mostly used synthetic data, where ground truth labels are available. We further evaluate the model on recordings to demonstrate that it also performs well on real-world audios.
Researcher Affiliation Collaboration Zhoutong Zhang MIT Qiujia Li University of Cambridge Zhengjia Huang Shanghai Tech University Jiajun Wu MIT Joshua B. Tenenbaum MIT William T. Freeman MIT, Google Research
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using Bullet, an open-source physics engine, but does not provide concrete access to the source code for the methodology described in this paper.
Open Datasets No Because real audio recordings with rich labels are hard to acquire, we synthesize random audio clips using our physics-based simulation to evaluate our models. Specifically, we focus on a single scenario shape primitives falling onto the ground. We first construct an audio dataset that includes 14 primitives (some shown in Table 2), each with 10 different specific moduli (defined as Young s modulus over density).
Dataset Splits No The paper mentions that the fully supervised model is trained on "200,000 audios", and "52 test cases" for human studies, but does not provide specific dataset split information (exact percentages, sample counts for train/validation/test, or detailed splitting methodology) needed to reproduce the data partitioning for its machine learning models.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It only mentions implementing the framework in Torch7.
Software Dependencies No The paper mentions using "Bullet" (physics engine) and "Torch7" (framework), but does not provide specific version numbers for these software components, which are required for a reproducible description of ancillary software.
Experiment Setup Yes We use a time step of 1/300 second to ensure simulation accuracy. We perform 80 sweeps of MCMC sampling over all the 7 latent variables; for every sweep, each variable is sampled twice. The spectrogram of the signal using a Tukey window of length 5,000 with a 2,000 sample overlap. For each window, a 10,000 point Fourier transform is applied. We used stochastic gradient descent for training, with a learning rate of 0.001, a momentum of 0.9 and a batch size of 16. Mean Square Error(MSE) loss is used for back-propagation.