The Lottery Ticket Hypothesis for Pre-trained BERT Networks
Authors: Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we combine these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, we indeed find matching subnetworks at 40% to 90% sparsity. We find these subnetworks at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training. Subnetworks found on the masked language modeling task (the same task used to pre-train the model) transfer universally; those found on other tasks transfer in a limited fashion if at all. |
| Researcher Affiliation | Collaboration | Tianlong Chen1, Jonathan Frankle2, Shiyu Chang3, Sijia Liu3, Yang Zhang3, Zhangyang Wang1, Michael Carbin2 1University of Texas at Austin, 2MIT CSAIL, 3MIT-IBM Watson AI Lab, IBM Research |
| Pseudocode | Yes | Algorithm 1 Iterative Magnitude Pruning (IMP) to sparsity s with rewinding step i. 1: Train the pre-trained network f(x; θ0, γ0) to step i: f(x; θi, γi) = AT i (f(x; θ0, γ0)). 2: Set the initial pruning mask to m = 1d1. 3: repeat 4: Train f(x; m θi, γi) to step t: f(x; m θt, γt) = AT t i(f(x; m θi, γi)). 5: Prune 10% of remaining weights [28] of m θt and update m accordingly. 6: until the sparsity of m reaches s 7: Return f(x; m θi). |
| Open Source Code | Yes | Codes available at https://github.com/VITA-Group/BERT-Tickets. |
| Open Datasets | Yes | Downstream tasks include nine tasks from GLUE benchmark [50] and another question-answering dataset, SQu AD v1.1 [51]. |
| Dataset Splits | Yes | All experiment results we presented are calculated from the validation/dev datasets. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models or specific cloud instances with specs) are mentioned. The paper only acknowledges 'computing resources'. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) are explicitly mentioned in the paper, beyond general tools like 'Adam W' or implicitly 'HuggingFace Transformers' without version. |
| Experiment Setup | Yes | Table 1: Details of pre-training and fine-tuning... # Epochs, Batch Size, Learning Rate... Optimizer Adam W [52] with ϵ = 1 10 8. Learning rate decays linearly from initial value to zero. |