SoundNet: Learning Sound Representations from Unlabeled Video

Authors: Yusuf Aytar, Carl Vondrick, Antonio Torralba

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels. ... We present a deep convolutional network that learns directly on raw audio waveforms, which is trained by transferring knowledge from vision into sound. ... In our experiments, we show that the representation learned by our network obtains state-of-the-art accuracy on three standard acoustic scene classification datasets. ... We evaluate the Sound Net representation for acoustic scene classification. The aim in this task is to categorize sound clips into one of the many acoustic scene categories. We use three standard, publicly available datasets: DCASE Challenge[34], ESC-50 [28], and ESC-10 [28].
Researcher Affiliation Academia Yusuf Aytar MIT yusuf@csail.mit.edu Carl Vondrick MIT vondrick@mit.edu Antonio Torralba MIT torralba@mit.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No Code, data, and models will be released.
Open Datasets Yes We use three standard, publicly available datasets: DCASE Challenge[34], ESC-50 [28], and ESC-10 [28].
Dataset Splits Yes We split the unlabeled video dataset into a training set and a held-out validation set. We use 2,000,000 videos for training, and the remaining 140,000 videos for validation. ... The data is prearranged into 5 folds and the accuracy results are reported as the mean of 5 leave-one-fold-out evaluations.
Hardware Specification No Optimization typically took 1 day on a GPU. ... We are grateful for the GPUs donated by NVidia.
Software Dependencies No Our approach is implemented in Torch7. ... We use the Adam [16] optimizer...
Experiment Setup Yes Our approach is implemented in Torch7. We use the Adam [16] optimizer and a fixed learning rate of 0.001 and momentum term of 0.9 throughout our experiments. We experimented with several batch sizes, and found 64 to produce good results. We initialized all the weights to zero mean Gaussian noise with a standard deviation of 0.01. After every convolution, we use batch normalization [15] and rectified linear activation units [17]. We train the network for 100,000 iterations.