Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models
Authors: Hong Liu, Sang Michael Xie, Zhiyuan Li, Tengyu Ma
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | These experiments demonstrate the existence of implicit bias of pre-training algorithms among models with the same minimal pre-training loss, they implicitly prefer more transferable ones. and We corroborate our theory with empirical evidence in Section 4. We show that for models with the same pre-training loss in the three situations above, the trace of Hessian of the pre-training loss strongly correlates with the downstream performance (See Figure 1). |
| Researcher Affiliation | Academia | Hong Liu 1 Sang Michael Xie 1 Zhiyuan Li 1 Tengyu Ma 1 and 1Department of Computer Science, Stanford Univerisity. Correspondence to: Hong Liu <hliu99@stanford.edu>. |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is provided in https://github.com/Liuhong99/implicitbiasmlmcode. |
| Open Datasets | Yes | We use Open Web Text https://huggingface.co/datasets/openwebtext and Book Corpus https: //huggingface.co/datasets/bookcorpus from huggingface. |
| Dataset Splits | No | The paper mentions 'validation pre-training loss' and evaluates on 'validation datasets in the standard ways', but it does not provide explicit split percentages or sample counts for training, validation, and test sets. It does mention the total number of examples for some downstream tasks (e.g., 'Each of the downstream task contains 0.1M examples'). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | The paper mentions several software components (e.g., 'Adam W', 'PyTorch', 'bert-large-uncased'), but it does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We use Adam W with constant learning rate 0.001. β1 =0.9 and β1 =0.98. We linearly increase the learning rate to do the warmup for 1000 steps on synthetic datasets and 5000 steps on real datasets. We include standard regularization, which is 0.1 dropout and 0.01 weight decay. We always use batchsize = 4096. |