Efficient Training of Language Models using Few-Shot Learning
Authors: Sashank J. Reddi, Sobhan Miryoosefi, Stefani Karp, Shankar Krishnan, Satyen Kale, Seungyeon Kim, Sanjiv Kumar
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experiments. We conduct comprehensive experiments demonstrating several variations of our approach and show that they outperform the baselines. |
| Researcher Affiliation | Collaboration | 1Google Research, NY, USA 2Carnegie Mellon University, Pittsburgh, PA. Correspondence to: Sashank J. Reddi <sashank@google.com>, Sobhan Miryoosefi<miryoosefi@google.com>. |
| Pseudocode | Yes | Algorithm 1 Few-Shot Stacking. Algorithm 2 Independent LM & Few-shot learner. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing its source code or provide a link to a code repository for the described methodology. |
| Open Datasets | Yes | BERT is trained on the Books Corpus (800M words) and Wikipedia (2,500M words). We use the same dataset for the experiments. |
| Dataset Splits | No | The paper mentions using the Books Corpus and Wikipedia for training, but it does not explicitly provide details about specific training/validation/test dataset splits, percentages, or sample counts used for its experiments. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware components (e.g., GPU models, CPU types, or cloud computing instances) used for running its experiments. |
| Software Dependencies | No | The paper mentions using optimizers like AdamW and specific BERT models, but it does not provide version numbers for any software dependencies such as deep learning frameworks (e.g., TensorFlow, PyTorch) or programming languages (e.g., Python). |
| Experiment Setup | Yes | Unless explicitly stated otherwise, all BERT-BASE and BERT-LARGE experiments used the following hyperparameter settings. Each stage began with 10,000 linear warmup steps (from a learning rate of 0 to a learning rate of 0.0001). After warmup, the learning rate was held constant throughout the stage, for all stages other than the final stage. In the final stage, after warmup, the learning rate was linearly decayed to 0. AdamW was used as the optimizer, with β1 = 0.9, β2 = 0.999, ϵ = 10^-7, and 0 weight decay. |