Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
SoundNet: Learning Sound Representations from Unlabeled Video
Authors: Yusuf Aytar, Carl Vondrick, Antonio Torralba
NeurIPS 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels. ... We present a deep convolutional network that learns directly on raw audio waveforms, which is trained by transferring knowledge from vision into sound. ... In our experiments, we show that the representation learned by our network obtains state-of-the-art accuracy on three standard acoustic scene classification datasets. ... We evaluate the Sound Net representation for acoustic scene classification. The aim in this task is to categorize sound clips into one of the many acoustic scene categories. We use three standard, publicly available datasets: DCASE Challenge[34], ESC-50 [28], and ESC-10 [28]. |
| Researcher Affiliation | Academia | Yusuf Aytar MIT EMAIL Carl Vondrick MIT EMAIL Antonio Torralba MIT EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Code, data, and models will be released. |
| Open Datasets | Yes | We use three standard, publicly available datasets: DCASE Challenge[34], ESC-50 [28], and ESC-10 [28]. |
| Dataset Splits | Yes | We split the unlabeled video dataset into a training set and a held-out validation set. We use 2,000,000 videos for training, and the remaining 140,000 videos for validation. ... The data is prearranged into 5 folds and the accuracy results are reported as the mean of 5 leave-one-fold-out evaluations. |
| Hardware Specification | No | Optimization typically took 1 day on a GPU. ... We are grateful for the GPUs donated by NVidia. |
| Software Dependencies | No | Our approach is implemented in Torch7. ... We use the Adam [16] optimizer... |
| Experiment Setup | Yes | Our approach is implemented in Torch7. We use the Adam [16] optimizer and a fixed learning rate of 0.001 and momentum term of 0.9 throughout our experiments. We experimented with several batch sizes, and found 64 to produce good results. We initialized all the weights to zero mean Gaussian noise with a standard deviation of 0.01. After every convolution, we use batch normalization [15] and rectified linear activation units [17]. We train the network for 100,000 iterations. |