reproducibilityindex.ai

Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

Authors: Hong Liu, Sang Michael Xie, Zhiyuan Li, Tengyu Ma

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	These experiments demonstrate the existence of implicit bias of pre-training algorithms among models with the same minimal pre-training loss, they implicitly prefer more transferable ones. and We corroborate our theory with empirical evidence in Section 4. We show that for models with the same pre-training loss in the three situations above, the trace of Hessian of the pre-training loss strongly correlates with the downstream performance (See Figure 1).
Researcher Affiliation	Academia	Hong Liu 1 Sang Michael Xie 1 Zhiyuan Li 1 Tengyu Ma 1 and 1Department of Computer Science, Stanford Univerisity. Correspondence to: Hong Liu <hliu99@stanford.edu>.
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is provided in https://github.com/Liuhong99/implicitbiasmlmcode.
Open Datasets	Yes	We use Open Web Text https://huggingface.co/datasets/openwebtext and Book Corpus https: //huggingface.co/datasets/bookcorpus from huggingface.
Dataset Splits	No	The paper mentions 'validation pre-training loss' and evaluates on 'validation datasets in the standard ways', but it does not provide explicit split percentages or sample counts for training, validation, and test sets. It does mention the total number of examples for some downstream tasks (e.g., 'Each of the downstream task contains 0.1M examples').
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies	No	The paper mentions several software components (e.g., 'Adam W', 'PyTorch', 'bert-large-uncased'), but it does not provide specific version numbers for any of them.
Experiment Setup	Yes	We use Adam W with constant learning rate 0.001. β1 =0.9 and β1 =0.98. We linearly increase the learning rate to do the warmup for 1000 steps on synthetic datasets and 5000 steps on real datasets. We include standard regularization, which is 0.1 dropout and 0.01 weight decay. We always use batchsize = 4096.