Decoupled Kullback-Leibler Divergence Loss

Authors: Jiequan Cui, Zhuotao Tian, Zhisheng Zhong, Xiaojuan Qi, Bei Yu, Hanwang Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard Robust Bench and competitive performance on knowledge distillation, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.
Researcher Affiliation Collaboration Jiequan Cui1 Zhuotao Tian4 Zhisheng Zhong2 Xiaojuan Qi3 Bei Yu2 Hanwang Zhang1 Nanyang Technological University1 The Chinese University of Hong Kong2 The University of Hong Kong3 HIT(SZ)4
Pseudocode Yes Algorithm 1 Pseudo code for DKL/IKL loss in Pytorch style. Algorithm 2 Memory efficient implementation for w MSE_loss in Pytorch style.
Open Source Code Yes Our code is available at https://github.com/jiequancui/DKL.
Open Datasets Yes We evaluate its effectiveness by conducting experiments on CIFAR-10/100 and Image Net datasets... All the datasets we considered are publicly available, we list their licenses and URLs as follows: CIFAR-10 [41]: MIT License, https://www.cs.toronto.edu/~kriz/cifar.html. CIFAR-100 [41]: MIT License, https://www.cs.toronto.edu/~kriz/cifar.html. Image Net [54]: Non-commercial, http://image-net.org.
Dataset Splits Yes Table 4: Top-1 accuracy (%) on the Image Net validation and training speed (sec/iteration) comparisons.
Hardware Specification Yes Models are trained with 4 Nvidia Ge Force 3090 GPUs.
Software Dependencies No The paper mentions 'Pytorch style' in its pseudocode but does not specify version numbers for Python, PyTorch, CUDA, or other key libraries used in the experiments.
Experiment Setup Yes We use an improved version of TRADES [71] as our baseline, which incorporates AWP [66] and adopts an increasing epsilon schedule. SGD optimizer with a momentum of 0.9 is used. We use the cosine learning rate strategy with an initial learning rate of 0.2 and train models 200 epochs. The batch size is 128, the weight decay is 5e-4 and the perturbation size ϵ is set to 8/255.