A Primal-Dual Framework for Transformers and Neural Networks
Authors: Tan Minh Nguyen, Tam Minh Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard Baraniuk, Stanley Osher
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model s accuracy, and improving the model s efficiency in a variety of practical applications including image and time-series classification. |
| Researcher Affiliation | Academia | Tan M. Nguyen* Department of Mathematics University of California, Los Angeles tanmnguyen89@ucla.edu Tam Nguyen* Department of ECE Rice University nguyenminhtam9520@gmail.com Nhat Ho Department of Statistics & Data Sciences University of Texas at Austin minhnhat@utexas.edu Andrea L. Bertozzi Department of Mathematics University of California, Los Angeles bertozzi@math.ucla.edu Richard G. Baraniuk** Department of ECE Rice University richb@rice.edu Stanley J. Osher** Department of Mathematics University of California, Los Angeles sjo@math.ucla.edu |
| Pseudocode | No | No clearly labeled pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Implementation available at https://github.com/thuml/Flowformer. |
| Open Datasets | Yes | We empirically demonstrate the advantages of our Attention-BN, Attention-SH, and their combination (Attention-BN+SH) over the baseline softmax attention on the UEA timeseries classification benchmark (Bagnall et al., 2018), the Long Range Arena benchmark (Tay et al., 2021), and the image classification task on the Imagenet dataset (Deng et al., 2009; Russakovsky et al., 2015). |
| Dataset Splits | Yes | The Image Net dataset (Deng et al., 2009; Russakovsky et al., 2015) consists of 1.28M training images and 50K validation images. |
| Hardware Specification | Yes | All of our experiments are conducted on a server with 4 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper refers to third-party implementations and their respective GitHub repositories but does not explicitly list the specific versions of programming languages or software libraries used in their own experimental setup. |
| Experiment Setup | Yes | In our experiments, we consider the constant β in Attention-BN/BN+SH and the different downsampling scales in Attention-SH/SH+BN as hyper-parameters to finetune. All of our experiments are conducted on a server with 4 NVIDIA A100 GPUs. In all models, the number of heads is 8, whereas the model dimension and number of transformer layers are varied. For Attention-SH/SH+BN, we downsample keys and values by the factor of 2, after every two successive heads. |