The Lottery Ticket Hypothesis for Pre-trained BERT Networks

Authors: Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we combine these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, we indeed find matching subnetworks at 40% to 90% sparsity. We find these subnetworks at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training. Subnetworks found on the masked language modeling task (the same task used to pre-train the model) transfer universally; those found on other tasks transfer in a limited fashion if at all.
Researcher Affiliation Collaboration Tianlong Chen1, Jonathan Frankle2, Shiyu Chang3, Sijia Liu3, Yang Zhang3, Zhangyang Wang1, Michael Carbin2 1University of Texas at Austin, 2MIT CSAIL, 3MIT-IBM Watson AI Lab, IBM Research
Pseudocode Yes Algorithm 1 Iterative Magnitude Pruning (IMP) to sparsity s with rewinding step i. 1: Train the pre-trained network f(x; θ0, γ0) to step i: f(x; θi, γi) = AT i (f(x; θ0, γ0)). 2: Set the initial pruning mask to m = 1d1. 3: repeat 4: Train f(x; m θi, γi) to step t: f(x; m θt, γt) = AT t i(f(x; m θi, γi)). 5: Prune 10% of remaining weights [28] of m θt and update m accordingly. 6: until the sparsity of m reaches s 7: Return f(x; m θi).
Open Source Code Yes Codes available at https://github.com/VITA-Group/BERT-Tickets.
Open Datasets Yes Downstream tasks include nine tasks from GLUE benchmark [50] and another question-answering dataset, SQu AD v1.1 [51].
Dataset Splits Yes All experiment results we presented are calculated from the validation/dev datasets.
Hardware Specification No No specific hardware details (like GPU/CPU models or specific cloud instances with specs) are mentioned. The paper only acknowledges 'computing resources'.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) are explicitly mentioned in the paper, beyond general tools like 'Adam W' or implicitly 'HuggingFace Transformers' without version.
Experiment Setup Yes Table 1: Details of pre-training and fine-tuning... # Epochs, Batch Size, Learning Rate... Optimizer Adam W [52] with ϵ = 1 10 8. Learning rate decays linearly from initial value to zero.