Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SoundNet: Learning Sound Representations from Unlabeled Video

Authors: Yusuf Aytar, Carl Vondrick, Antonio Torralba

NeurIPS 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our sound representation yields signiﬁcant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classiﬁcation. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels. ... We present a deep convolutional network that learns directly on raw audio waveforms, which is trained by transferring knowledge from vision into sound. ... In our experiments, we show that the representation learned by our network obtains state-of-the-art accuracy on three standard acoustic scene classiﬁcation datasets. ... We evaluate the Sound Net representation for acoustic scene classiﬁcation. The aim in this task is to categorize sound clips into one of the many acoustic scene categories. We use three standard, publicly available datasets: DCASE Challenge[34], ESC-50 [28], and ESC-10 [28].
Researcher Affiliation	Academia	Yusuf Aytar MIT EMAIL Carl Vondrick MIT EMAIL Antonio Torralba MIT EMAIL
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	Code, data, and models will be released.
Open Datasets	Yes	We use three standard, publicly available datasets: DCASE Challenge[34], ESC-50 [28], and ESC-10 [28].
Dataset Splits	Yes	We split the unlabeled video dataset into a training set and a held-out validation set. We use 2,000,000 videos for training, and the remaining 140,000 videos for validation. ... The data is prearranged into 5 folds and the accuracy results are reported as the mean of 5 leave-one-fold-out evaluations.
Hardware Specification	No	Optimization typically took 1 day on a GPU. ... We are grateful for the GPUs donated by NVidia.
Software Dependencies	No	Our approach is implemented in Torch7. ... We use the Adam [16] optimizer...
Experiment Setup	Yes	Our approach is implemented in Torch7. We use the Adam [16] optimizer and a ﬁxed learning rate of 0.001 and momentum term of 0.9 throughout our experiments. We experimented with several batch sizes, and found 64 to produce good results. We initialized all the weights to zero mean Gaussian noise with a standard deviation of 0.01. After every convolution, we use batch normalization [15] and rectiﬁed linear activation units [17]. We train the network for 100,000 iterations.