On Separate Normalization in Self-supervised Transformers
Authors: Xiaohui Chen, Yinkai Wang, Yuanqi Du, Soha Hassoun, Liping Liu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical study shows that the [CLS] embeddings learned with our separate normalization layer better encode the global contextual information and are distributed more uniformly in its anisotropic space. |
| Researcher Affiliation | Academia | Xiaohui Chen Department of Computer Science Tufts University Medford, MA 02155 xiaohui.chen@tufts.edu Yinkai Wang Department of Computer Science Tufts University Medford, MA 02155 yinkai.wang@tufts.edu Yuanqi Du Department of Computer Science Cornell University Ithaca, NY 14850 yd392@cornell.edu Soha Hassoun Department of Computer Science Tufts University Medford, MA 02155 soha.hassoun@tufts.edu Li-Ping Liu Department of Computer Science Tufts University Medford, MA 02155 liping.liu@tufts.edu |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found in the paper. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | Datasets. We investigate the model performance on the four image datasets: STL10 [Coates et al., 2011], FGVC Aircraft [Maji et al., 2013], Street View House Numbers (SVHN) [Netzer et al., 2011], and Oxford 102 Flowers [Nilsback and Zisserman, 2008]. ... We conducted experiments using the ZINC dataset [Irwin and Shoichet, 2005], which contains approximately 250,000 molecular graphs. ... We also the Mol HIV dataset from the OGB [Hu et al., 2020a] collection... |
| Dataset Splits | No | The paper states, 'We follow the train/test split provided in the papers introducing the datasets.' However, it does not explicitly provide details for a separate validation split, its size, or specific splitting methodology for all experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions various models and frameworks (e.g., BERT, RoBERTa, MAE, Graphormer) but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We follow the setup in He et al. [2022] to pretrain and evaluate the Vi T. For pertaining, we train the Vi T for 4000 epochs. For linear probing, we freeze the encoder s weight and train the last layer on the specific datasets for 2000 epochs. We use a batch size of 512 for pretraining and a batch size of 128 for linear probing. We choose λ = {0, 0.01, 0.1, 1}. |