Designing Robust Transformers using Robust Kernel Density Estimation

Authors: Xing Han, Tongzheng Ren, Tan Nguyen, Khai Nguyen, Joydeep Ghosh, Nhat Ho

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we provide empirical validation of the benefits of integrating our proposed robust KDE attention mechanisms (Transformer-RKDE/SPKDE/Mo M) into Transformer base models. We compare these with the standard softmax Transformer across multiple datasets representing different modalities. These include language modeling on the Wiki Text-103 dataset (Merity et al., 2016) (Section 4.1) and image classification on Image Net (Russakovsky et al., 2015; Deng et al., 2009).
Researcher Affiliation Academia Xing Han Department of ECE University of Texas at Austin aaronhan223@utexas.edu Tongzheng Ren Department of Computer Science University of Texas at Austin tongzheng@utexas.edu Tan Minh Nguyen Department of Mathematics University of California, Los Angeles tanmnguyen89@ucla.edu Khai Nguyen Department of Statistics and Data Sciences University of Texas at Austin khainb@utexas.edu Joydeep Ghosh Department of ECE University of Texas at Austin jghosh@utexas.edu Nhat Ho Department of Statistics and Data Sciences University of Texas at Austin minhnhat@utexas.edu
Pseudocode Yes Algorithm 1 Procedure of Computing Attention Vector of Transformer-RKDE/SPKDE/Mo M
Open Source Code No No explicit statement or link was found where the authors provide their own source code for the methodology described in this paper. The only code link found ('Implementation available at github.com/QData/TextAttack') refers to a third-party tool used for an attack method.
Open Datasets Yes These include language modeling on the Wiki Text-103 dataset (Merity et al., 2016) (Section 4.1) and image classification on Image Net (Russakovsky et al., 2015; Deng et al., 2009). Furthermore, we assess performance across multiple robustness benchmarks, namely Image Net-C (Hendrycks & Dietterich, 2019), Image Net-A (Hendrycks et al., 2021b), Image Net-O (Hendrycks et al., 2021b), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019) (Section 4.2), as well as UEA time-series classification (Section 4.3).
Dataset Splits Yes Table 1 presents the validation and test perplexity (PPL) for several methods. The validation set and test sets consist of 60 articles with 218K and 246K tokens respectively.
Hardware Specification Yes All experiments were conducted on machines with 4 NVIDIA A-100 GPUs.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) were explicitly mentioned in the paper.
Experiment Setup Yes We configured the dimensions of key, value, and query to 128, and set the training and evaluation context length to 256. For self-attention, we allocated 8 heads for our methods and Performer, and 4 for Transformer-MGK. The dimension of the feedforward layer was set to 2048, with the number of layers established at 16. ... Each attack distorts the input image with a perturbation budget ϵ = 1/255 under l∞ norm, while the PGD attack uses 20 steps with a step size of α = 0.15.