reproducibilityindex.ai

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

Authors: Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we combine these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, we indeed find matching subnetworks at 40% to 90% sparsity. We find these subnetworks at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training. Subnetworks found on the masked language modeling task (the same task used to pre-train the model) transfer universally; those found on other tasks transfer in a limited fashion if at all.
Researcher Affiliation	Collaboration	Tianlong Chen1, Jonathan Frankle2, Shiyu Chang3, Sijia Liu3, Yang Zhang3, Zhangyang Wang1, Michael Carbin2 1University of Texas at Austin, 2MIT CSAIL, 3MIT-IBM Watson AI Lab, IBM Research
Pseudocode	Yes	Algorithm 1 Iterative Magnitude Pruning (IMP) to sparsity s with rewinding step i. 1: Train the pre-trained network f(x; θ0, γ0) to step i: f(x; θi, γi) = AT i (f(x; θ0, γ0)). 2: Set the initial pruning mask to m = 1d1. 3: repeat 4: Train f(x; m θi, γi) to step t: f(x; m θt, γt) = AT t i(f(x; m θi, γi)). 5: Prune 10% of remaining weights [28] of m θt and update m accordingly. 6: until the sparsity of m reaches s 7: Return f(x; m θi).
Open Source Code	Yes	Codes available at https://github.com/VITA-Group/BERT-Tickets.
Open Datasets	Yes	Downstream tasks include nine tasks from GLUE benchmark [50] and another question-answering dataset, SQu AD v1.1 [51].
Dataset Splits	Yes	All experiment results we presented are calculated from the validation/dev datasets.
Hardware Specification	No	No specific hardware details (like GPU/CPU models or specific cloud instances with specs) are mentioned. The paper only acknowledges 'computing resources'.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) are explicitly mentioned in the paper, beyond general tools like 'Adam W' or implicitly 'HuggingFace Transformers' without version.
Experiment Setup	Yes	Table 1: Details of pre-training and ﬁne-tuning... # Epochs, Batch Size, Learning Rate... Optimizer Adam W [52] with ϵ = 1 10 8. Learning rate decays linearly from initial value to zero.