On Separate Normalization in Self-supervised Transformers

Authors: Xiaohui Chen, Yinkai Wang, Yuanqi Du, Soha Hassoun, Liping Liu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical study shows that the [CLS] embeddings learned with our separate normalization layer better encode the global contextual information and are distributed more uniformly in its anisotropic space.
Researcher Affiliation Academia Xiaohui Chen Department of Computer Science Tufts University Medford, MA 02155 xiaohui.chen@tufts.edu Yinkai Wang Department of Computer Science Tufts University Medford, MA 02155 yinkai.wang@tufts.edu Yuanqi Du Department of Computer Science Cornell University Ithaca, NY 14850 yd392@cornell.edu Soha Hassoun Department of Computer Science Tufts University Medford, MA 02155 soha.hassoun@tufts.edu Li-Ping Liu Department of Computer Science Tufts University Medford, MA 02155 liping.liu@tufts.edu
Pseudocode No No pseudocode or clearly labeled algorithm block was found in the paper.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes Datasets. We investigate the model performance on the four image datasets: STL10 [Coates et al., 2011], FGVC Aircraft [Maji et al., 2013], Street View House Numbers (SVHN) [Netzer et al., 2011], and Oxford 102 Flowers [Nilsback and Zisserman, 2008]. ... We conducted experiments using the ZINC dataset [Irwin and Shoichet, 2005], which contains approximately 250,000 molecular graphs. ... We also the Mol HIV dataset from the OGB [Hu et al., 2020a] collection...
Dataset Splits No The paper states, 'We follow the train/test split provided in the papers introducing the datasets.' However, it does not explicitly provide details for a separate validation split, its size, or specific splitting methodology for all experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper mentions various models and frameworks (e.g., BERT, RoBERTa, MAE, Graphormer) but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We follow the setup in He et al. [2022] to pretrain and evaluate the Vi T. For pertaining, we train the Vi T for 4000 epochs. For linear probing, we freeze the encoder s weight and train the last layer on the specific datasets for 2000 epochs. We use a batch size of 512 for pretraining and a batch size of 128 for linear probing. We choose λ = {0, 0.01, 0.1, 1}.