SoundNet: Learning Sound Representations from Unlabeled Video
Authors: Yusuf Aytar, Carl Vondrick, Antonio Torralba
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels. ... We present a deep convolutional network that learns directly on raw audio waveforms, which is trained by transferring knowledge from vision into sound. ... In our experiments, we show that the representation learned by our network obtains state-of-the-art accuracy on three standard acoustic scene classification datasets. ... We evaluate the Sound Net representation for acoustic scene classification. The aim in this task is to categorize sound clips into one of the many acoustic scene categories. We use three standard, publicly available datasets: DCASE Challenge[34], ESC-50 [28], and ESC-10 [28]. |
| Researcher Affiliation | Academia | Yusuf Aytar MIT yusuf@csail.mit.edu Carl Vondrick MIT vondrick@mit.edu Antonio Torralba MIT torralba@mit.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Code, data, and models will be released. |
| Open Datasets | Yes | We use three standard, publicly available datasets: DCASE Challenge[34], ESC-50 [28], and ESC-10 [28]. |
| Dataset Splits | Yes | We split the unlabeled video dataset into a training set and a held-out validation set. We use 2,000,000 videos for training, and the remaining 140,000 videos for validation. ... The data is prearranged into 5 folds and the accuracy results are reported as the mean of 5 leave-one-fold-out evaluations. |
| Hardware Specification | No | Optimization typically took 1 day on a GPU. ... We are grateful for the GPUs donated by NVidia. |
| Software Dependencies | No | Our approach is implemented in Torch7. ... We use the Adam [16] optimizer... |
| Experiment Setup | Yes | Our approach is implemented in Torch7. We use the Adam [16] optimizer and a fixed learning rate of 0.001 and momentum term of 0.9 throughout our experiments. We experimented with several batch sizes, and found 64 to produce good results. We initialized all the weights to zero mean Gaussian noise with a standard deviation of 0.01. After every convolution, we use batch normalization [15] and rectified linear activation units [17]. We train the network for 100,000 iterations. |