Learning Sound Events from Webly Labeled Data
Authors: Anurag Kumar, Ankit Shah, Alexander Hauptmann, Bhiksha Raj
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our proposed system, Webly Net, two deep neural networks co-teach each other to robustly learn from webly labeled data, leading to around 17% relative improvement over the baseline method. |
| Researcher Affiliation | Academia | Anurag Kumar , Ankit Shah , Alexander Hauptmann and Bhiksha Raj Language Technologies Institute, School of Computer Science, Carnegie Mellon University argxkr@gmail.com, {aps1, alex, bhiksha}@cs.cmu.edu |
| Pseudocode | Yes | Algorithm 1 outlines this procedure. |
| Open Source Code | Yes | Please visit https: //github.com/anuragkr90/webly-labeled-sounds for webly labeled data, codebase and additional analysis. |
| Open Datasets | Yes | We formed two datasets using the above strategy. The first one referred to as Webly-2k uses top 50 retrieved videos for each class and has around 1,900 audio recordings. The second one, Webly-4k, uses the top 100 retrieved videos for each class and contains around 3,800 recordings. Please visit https: //github.com/anuragkr90/webly-labeled-sounds for webly labeled data, codebase and additional analysis. |
| Dataset Splits | Yes | A subset of recordings from the Unbalanced set of Audio Set is used for validation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper mentions "All experiments are done in Py Torch toolkit" but does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | N1 is trained on the first (X1) audio representations. It is a deep CNN. The layer blocks from B1 to B4 consists of two convolutional layers followed by a max-pooling layer. The number of filters in both convolutional layers of these blocks are, { B1:64, B2:128, B3:256, B4:256 }. The convolutional filters are of size 3 3 in all cases, and the convolution operation is done with a stride of 1. Padding of 1 is also applied to inputs of all convolutional layers. The max-pooling in these blocks are done using a window of size 1 2, moving by the same amount. Layer F1 and F2 are again convolutional layers with 1024 filters of size 1 8 and 1024 filters of size 1 1 respectively. All convolutional layers from B1 to F2 consists includes batch-normalization [Ioffe and Szegedy2015] and Re LU (max(0, x)) activations. The network N2 (with X2 as inputs) consists of 3 fully connected hidden layers with 2048, 1024 and 1024 neurons respectively. The output layer contains C number of neurons. A dropout of 0.4 is applied after first and second hidden layers. Re LU activation is used in all hidden layers and sigmoid in the output layer. The network is trained through Adam optimization [Kingma and Ba2014]. Hyperparameters are tuned using the validation set. |