ContraNorm: A Contrastive Learning Perspective on Oversmoothing and Beyond

Authors: Xiaojun Guo, Yifei Wang, Tianqi Du, Yisen Wang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on various real-world datasets demonstrate the effectiveness of our proposed Contra Norm.
Researcher Affiliation Academia Xiaojun Guo1 Yifei Wang2 Tianqi Du2 Yisen Wang1,3 1National Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2School of Mathematical Sciences, Peking University 3Institute for Artificial Intelligence, Peking University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our implementation is available at https://github.com/PKU-ML/Contra Norm.
Open Datasets Yes GLUE datasets (Wang et al., 2019), Image Net100 and Image Net1k datasets (Russakovsky et al., 2015), Cora (Mc Callum et al., 2000) and Citeseer (Giles et al., 1998), Chameleon and Squirrel (Rozemberczki et al., 2021).
Dataset Splits Yes Specifically, Contra Norm boosts the average performance of BERT (Devlin et al., 2018) from 82.59% to 83.54% on the validation set of General Language Understanding Evaluation (GLUE) datasets (Wang et al., 2019) and We follow data split setting in Kipf & Welling (2017) with train/validation/test splits of 60%, 20%, 20%, respectively.
Hardware Specification Yes All experiments are conducted on a single NVIDIA Ge Force RTX 3090.
Software Dependencies No The paper mentions optimizers (Adam, Adam W) and repositories (Timm, DeiT) but does not provide specific version numbers for key software components like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We use a batch size of 32 and fine-tune for 5 epochs over the data for all GLUE tasks. For each task, we select the best scale factor s in Eq.(6) among (0.005, 0.01, 0.05, 0.1, 0.2). We use base models (BERT-base and ALBERT-base) of 12 stacked blocks with hyperparameters fixed for all tasks: number of hidden size 128, number of attention heads 12, maximum sequence length 384. We use Adam (Kingma & Ba, 2014) optimizer with the learning rate of 2e 5. Specifically, we use Adam W (Loshchilov & Hutter, 2019) optimizer with cosine learning rate decay. We train each model for 150 epochs and the batch size is set to 1024. We choose the best of scale controller s in range of {0.2, 0.5, 0.8, 1.0} for both Pair Norm and Contra Norm.