Cross-Entropy Loss Functions: Theoretical Analysis and Applications
Authors: Anqi Mao, Mehryar Mohri, Yutao Zhong
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | While our main purpose is a theoretical analysis, we also present an extensive empirical analysis comparing comp-sum losses. We further report the results of a series of experiments demonstrating that our adversarial robustness algorithms outperform the current state-of-the-art, while also achieving a superior non-adversarial accuracy. |
| Researcher Affiliation | Collaboration | 1Courant Institute of Mathematical Sciences, New York, NY; 2Google Research, New York, NY. Correspondence to: Anqi Mao <aqmao@cims.nyu.edu>, Mehryar Mohri <mohri@google.com>, Yutao Zhong <yutao@cims.nyu.edu>. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that source code for the methodology is openly available. |
| Open Datasets | Yes | We further report the results of experiments with the CIFAR-10, CIFAR-100 and SVHN datasets... Krizhevsky, 2009; Netzer et al., 2011 |
| Dataset Splits | Yes | We used early stopping on a held-out validation set of 1,024 samples by evaluating its robust accuracy throughout training with 40-step PGD on the margin loss, denoted by PGD40 margin, and selecting the best check-point (Rice et al., 2020). |
| Hardware Specification | No | The paper does not explicitly state the specific hardware (e.g., GPU/CPU models, memory, or cloud instance types) used to run the experiments. |
| Software Dependencies | No | The paper mentions general algorithms and architectures (e.g., SGD, Nesterov momentum, Res Net), but does not specify software library names with version numbers required for replication (e.g., PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | All models were trained via Stochastic Gradient Descent (SGD) with Nesterov momentum (Nesterov, 1983), batch size 1,024 and weight decay 1 10 4. We used Res Net-34 and trained for 200 epochs using the cosine decay learning rate schedule (Loshchilov & Hutter, 2016) without restarts. The initial learning rate was selected from {0.01,0.1,1.0}... For TRADES, we adopted exactly the same setup as Gowal et al. (2020). For our smooth adversarial comp-sum losses, we set both ρ and ν to 1 by default... We trained for 400 epochs using the cosine decay learning rate schedule (Loshchilov & Hutter, 2016) without restarts. The initial learning rate is set to 0.4. We used model weight averaging (Izmailov et al., 2018) with decay rate 0.9975. |