Transformers from an Optimization Perspective
Authors: Yongyi Yang, zengfeng Huang, David P Wipf
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To this end, we implement a Transformer model as described previously, up to known limitations like symmetric weights. We apply this model to two benchmarks, IMDB [30] and SST2 [40], which are both commonly-used sentiment classification datasets that rely on Glove-840b-300d [33] as the word embedding. Figures 5 and 6 display the energy of the output of each layer of a Transformer (as defined in (8)) averaged over 200 randomly chosen samples in the test set. |
| Researcher Affiliation | Collaboration | Yongyi Yang University of Michigan yongyi@umich.edu Zengfeng Huang Fudan University huangzf@fudan.edu.cn David Wipf Amazon Web Services davidwipf@gmail.com Work completed during an internship at the AWS Shanghai AI Lab. |
| Pseudocode | Yes | Algorithm 1 For the t-th iteration, execute u(t) = y(t) α1 f(y(t)); y(t+1) = u(t) α2 g(u(t)). |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | Yes | We apply this model to two benchmarks, IMDB [30] and SST2 [40], which are both commonly-used sentiment classification datasets that rely on Glove-840b-300d [33] as the word embedding. |
| Dataset Splits | No | The paper mentions using IMDB and SST2 datasets and evaluating on "200 randomly chosen samples in the test set", but does not explicitly provide specific training/validation/test split percentages, sample counts, or explicit instructions on how to reproduce the data partitioning. |
| Hardware Specification | No | The paper states under its 'Questions for Paper Analysis' section that information regarding the total amount of compute and type of resources used is '[N/A]'. |
| Software Dependencies | No | The paper does not explicitly list any software dependencies with specific version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Figure 5 uses randomly initialized weights while Figure 6 involves weights trained for 2000 steps with SGD and learning rate 0.01. |