Cross-Entropy Loss Functions: Theoretical Analysis and Applications

Authors: Anqi Mao, Mehryar Mohri, Yutao Zhong

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental While our main purpose is a theoretical analysis, we also present an extensive empirical analysis comparing comp-sum losses. We further report the results of a series of experiments demonstrating that our adversarial robustness algorithms outperform the current state-of-the-art, while also achieving a superior non-adversarial accuracy.
Researcher Affiliation Collaboration 1Courant Institute of Mathematical Sciences, New York, NY; 2Google Research, New York, NY. Correspondence to: Anqi Mao <aqmao@cims.nyu.edu>, Mehryar Mohri <mohri@google.com>, Yutao Zhong <yutao@cims.nyu.edu>.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that source code for the methodology is openly available.
Open Datasets Yes We further report the results of experiments with the CIFAR-10, CIFAR-100 and SVHN datasets... Krizhevsky, 2009; Netzer et al., 2011
Dataset Splits Yes We used early stopping on a held-out validation set of 1,024 samples by evaluating its robust accuracy throughout training with 40-step PGD on the margin loss, denoted by PGD40 margin, and selecting the best check-point (Rice et al., 2020).
Hardware Specification No The paper does not explicitly state the specific hardware (e.g., GPU/CPU models, memory, or cloud instance types) used to run the experiments.
Software Dependencies No The paper mentions general algorithms and architectures (e.g., SGD, Nesterov momentum, Res Net), but does not specify software library names with version numbers required for replication (e.g., PyTorch, TensorFlow versions).
Experiment Setup Yes All models were trained via Stochastic Gradient Descent (SGD) with Nesterov momentum (Nesterov, 1983), batch size 1,024 and weight decay 1 10 4. We used Res Net-34 and trained for 200 epochs using the cosine decay learning rate schedule (Loshchilov & Hutter, 2016) without restarts. The initial learning rate was selected from {0.01,0.1,1.0}... For TRADES, we adopted exactly the same setup as Gowal et al. (2020). For our smooth adversarial comp-sum losses, we set both ρ and ν to 1 by default... We trained for 400 epochs using the cosine decay learning rate schedule (Loshchilov & Hutter, 2016) without restarts. The initial learning rate is set to 0.4. We used model weight averaging (Izmailov et al., 2018) with decay rate 0.9975.