A Primal-Dual Framework for Transformers and Neural Networks

Authors: Tan Minh Nguyen, Tam Minh Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard Baraniuk, Stanley Osher

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model s accuracy, and improving the model s efficiency in a variety of practical applications including image and time-series classification.
Researcher Affiliation Academia Tan M. Nguyen* Department of Mathematics University of California, Los Angeles tanmnguyen89@ucla.edu Tam Nguyen* Department of ECE Rice University nguyenminhtam9520@gmail.com Nhat Ho Department of Statistics & Data Sciences University of Texas at Austin minhnhat@utexas.edu Andrea L. Bertozzi Department of Mathematics University of California, Los Angeles bertozzi@math.ucla.edu Richard G. Baraniuk** Department of ECE Rice University richb@rice.edu Stanley J. Osher** Department of Mathematics University of California, Los Angeles sjo@math.ucla.edu
Pseudocode No No clearly labeled pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Implementation available at https://github.com/thuml/Flowformer.
Open Datasets Yes We empirically demonstrate the advantages of our Attention-BN, Attention-SH, and their combination (Attention-BN+SH) over the baseline softmax attention on the UEA timeseries classification benchmark (Bagnall et al., 2018), the Long Range Arena benchmark (Tay et al., 2021), and the image classification task on the Imagenet dataset (Deng et al., 2009; Russakovsky et al., 2015).
Dataset Splits Yes The Image Net dataset (Deng et al., 2009; Russakovsky et al., 2015) consists of 1.28M training images and 50K validation images.
Hardware Specification Yes All of our experiments are conducted on a server with 4 NVIDIA A100 GPUs.
Software Dependencies No The paper refers to third-party implementations and their respective GitHub repositories but does not explicitly list the specific versions of programming languages or software libraries used in their own experimental setup.
Experiment Setup Yes In our experiments, we consider the constant β in Attention-BN/BN+SH and the different downsampling scales in Attention-SH/SH+BN as hyper-parameters to finetune. All of our experiments are conducted on a server with 4 NVIDIA A100 GPUs. In all models, the number of heads is 8, whereas the model dimension and number of transformer layers are varied. For Attention-SH/SH+BN, we downsample keys and values by the factor of 2, after every two successive heads.