Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
The Lottery Ticket Hypothesis for Pre-trained BERT Networks
Authors: Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we combine these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, we indeed find matching subnetworks at 40% to 90% sparsity. We find these subnetworks at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training. Subnetworks found on the masked language modeling task (the same task used to pre-train the model) transfer universally; those found on other tasks transfer in a limited fashion if at all. |
| Researcher Affiliation | Collaboration | Tianlong Chen1, Jonathan Frankle2, Shiyu Chang3, Sijia Liu3, Yang Zhang3, Zhangyang Wang1, Michael Carbin2 1University of Texas at Austin, 2MIT CSAIL, 3MIT-IBM Watson AI Lab, IBM Research |
| Pseudocode | Yes | Algorithm 1 Iterative Magnitude Pruning (IMP) to sparsity s with rewinding step i. 1: Train the pre-trained network f(x; θ0, γ0) to step i: f(x; θi, γi) = AT i (f(x; θ0, γ0)). 2: Set the initial pruning mask to m = 1d1. 3: repeat 4: Train f(x; m θi, γi) to step t: f(x; m θt, γt) = AT t i(f(x; m θi, γi)). 5: Prune 10% of remaining weights [28] of m θt and update m accordingly. 6: until the sparsity of m reaches s 7: Return f(x; m θi). |
| Open Source Code | Yes | Codes available at https://github.com/VITA-Group/BERT-Tickets. |
| Open Datasets | Yes | Downstream tasks include nine tasks from GLUE benchmark [50] and another question-answering dataset, SQu AD v1.1 [51]. |
| Dataset Splits | Yes | All experiment results we presented are calculated from the validation/dev datasets. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models or specific cloud instances with specs) are mentioned. The paper only acknowledges 'computing resources'. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) are explicitly mentioned in the paper, beyond general tools like 'Adam W' or implicitly 'HuggingFace Transformers' without version. |
| Experiment Setup | Yes | Table 1: Details of pre-training and fine-tuning... # Epochs, Batch Size, Learning Rate... Optimizer Adam W [52] with ϵ = 1 10 8. Learning rate decays linearly from initial value to zero. |