Can gradient clipping mitigate label noise?
Authors: Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTAL ILLUSTRATION We now present experiments illustrating that: (a) we may exhibit label noise scenarios that defeat a Huberised but not partially Huberised loss, confirming Propositions 4, 7, and (b) partially Huberised versions of existing losses perform well on real-world datasets subject to label noise. |
| Researcher Affiliation | Industry | Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar Google Research New York, NY USA {adityakmenon,ankitsrawat,sashank,sanjivk}@google.com |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that source code for the methodology is available or provide a link to it. |
| Open Datasets | Yes | We now demonstrate that partially Huberised losses perform well with deep neural networks trained on MNIST, CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009). |
| Dataset Splits | Yes | We pick τ {2, 10} (equivalently corresponding to probability thresholds 0.5 and 0.1 respectively) so as to maximize accuracy on a validation set of noisy samples with the maximal noise rate ρ = 0.6; the chosen value of τ was then used for each noise level. |
| Hardware Specification | No | The paper does not provide specific hardware details (like GPU/CPU models or memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | For MNIST, we train a Le Net (Lecun et al., 1998) using Adam with batch size N = 32, and weight decay of 10 3. For CIFAR-10 and CIFAR-100, we train a Res Net-50 (He et al., 2016) using SGD with momentum 0.1, weight decay of 5 10 3, batch normalisation, and N = 64, 128 respectively. For each dataset, we pick τ {2, 10} (equivalently corresponding to probability thresholds 0.5 and 0.1 respectively) so as to maximize accuracy on a validation set of noisy samples with the maximal noise rate ρ = 0.6; the chosen value of τ was then used for each noise level. |