Transformers from an Optimization Perspective

Authors: Yongyi Yang, zengfeng Huang, David P Wipf

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To this end, we implement a Transformer model as described previously, up to known limitations like symmetric weights. We apply this model to two benchmarks, IMDB [30] and SST2 [40], which are both commonly-used sentiment classification datasets that rely on Glove-840b-300d [33] as the word embedding. Figures 5 and 6 display the energy of the output of each layer of a Transformer (as defined in (8)) averaged over 200 randomly chosen samples in the test set.
Researcher Affiliation Collaboration Yongyi Yang University of Michigan yongyi@umich.edu Zengfeng Huang Fudan University huangzf@fudan.edu.cn David Wipf Amazon Web Services davidwipf@gmail.com Work completed during an internship at the AWS Shanghai AI Lab.
Pseudocode Yes Algorithm 1 For the t-th iteration, execute u(t) = y(t) α1 f(y(t)); y(t+1) = u(t) α2 g(u(t)).
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets Yes We apply this model to two benchmarks, IMDB [30] and SST2 [40], which are both commonly-used sentiment classification datasets that rely on Glove-840b-300d [33] as the word embedding.
Dataset Splits No The paper mentions using IMDB and SST2 datasets and evaluating on "200 randomly chosen samples in the test set", but does not explicitly provide specific training/validation/test split percentages, sample counts, or explicit instructions on how to reproduce the data partitioning.
Hardware Specification No The paper states under its 'Questions for Paper Analysis' section that information regarding the total amount of compute and type of resources used is '[N/A]'.
Software Dependencies No The paper does not explicitly list any software dependencies with specific version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Figure 5 uses randomly initialized weights while Figure 6 involves weights trained for 2000 steps with SGD and learning rate 0.01.