Efficient Training of Language Models using Few-Shot Learning

Authors: Sashank J. Reddi, Sobhan Miryoosefi, Stefani Karp, Shankar Krishnan, Satyen Kale, Seungyeon Kim, Sanjiv Kumar

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments. We conduct comprehensive experiments demonstrating several variations of our approach and show that they outperform the baselines.
Researcher Affiliation Collaboration 1Google Research, NY, USA 2Carnegie Mellon University, Pittsburgh, PA. Correspondence to: Sashank J. Reddi <sashank@google.com>, Sobhan Miryoosefi<miryoosefi@google.com>.
Pseudocode Yes Algorithm 1 Few-Shot Stacking. Algorithm 2 Independent LM & Few-shot learner.
Open Source Code No The paper does not contain an explicit statement about releasing its source code or provide a link to a code repository for the described methodology.
Open Datasets Yes BERT is trained on the Books Corpus (800M words) and Wikipedia (2,500M words). We use the same dataset for the experiments.
Dataset Splits No The paper mentions using the Books Corpus and Wikipedia for training, but it does not explicitly provide details about specific training/validation/test dataset splits, percentages, or sample counts used for its experiments.
Hardware Specification No The paper does not explicitly mention specific hardware components (e.g., GPU models, CPU types, or cloud computing instances) used for running its experiments.
Software Dependencies No The paper mentions using optimizers like AdamW and specific BERT models, but it does not provide version numbers for any software dependencies such as deep learning frameworks (e.g., TensorFlow, PyTorch) or programming languages (e.g., Python).
Experiment Setup Yes Unless explicitly stated otherwise, all BERT-BASE and BERT-LARGE experiments used the following hyperparameter settings. Each stage began with 10,000 linear warmup steps (from a learning rate of 0 to a learning rate of 0.0001). After warmup, the learning rate was held constant throughout the stage, for all stages other than the final stage. In the final stage, after warmup, the learning rate was linearly decayed to 0. AdamW was used as the optimizer, with β1 = 0.9, β2 = 0.999, ϵ = 10^-7, and 0 weight decay.