Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data

Authors: Alon Albalak, Colin A. Raffel, William Yang Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experimentation we find that our methods outperform all pre-existing FLAD methods by 4% and lead to the first 3 billion parameter language models that outperform the 175 billion parameter GPT-3.
Researcher Affiliation Academia Alon Albalak University of California, Santa Barbara alon_albalak@ucsb.edu Colin Raffel University of Toronto Vector Institute craffel@gmail.com William Yang Wang University of California, Santa Barbara william@cs.ucsb.edu
Pseudocode Yes We include here pseudo-code for our 2 proposed algorithms. Algorithm 1 contains the pseudo-code for EXP3-FLAD, and Algorithm 2 contains the pseudo-code for UCB1-FLAD.
Open Source Code Yes All of our code is available at github.com/alon-albalak/FLAD.
Open Datasets Yes We obtain all datasets from Hugging Face Datasets1, and cast them to the text-to-text format by applying prompt templates from the Public Pool of Prompts (P3) [23] that was used to train T0.
Dataset Splits Yes For each dataset, we randomly sample five few-shot splits from their training data, containing the same number of training examples as previous works, between 20 to 70 [55, 56]. We further divide each split into equal training and validation partitions for true few-shot learning [57](e.g. 10 train and 10 validation samples for Hella Swag).
Hardware Specification Yes We train all models (FLAD and non-FLAD) on 40Gb A100s.
Software Dependencies No We used model checkpoints from Hugging Face Transformers [45]). For all experiments we use the Adafactor optimizer [58].
Experiment Setup Yes For the target-only baseline, we use learning rates in {1e-4, 3e-4}. For all other methods, we always use a learning rate of 1e-4. For target-, explore-, and exploit-only baselines we use batch sizes in {32, 128}. For loss-scaling, EXP3-FLAD, and UCB1-FLAD we use mini-batches of 8 samples and let G be in {4, 16} to match the batch size of all methods. For exploreand exploit-only, we use a target dataset mixing ratio of M {1, 5, 10}. For all experiments we use the Adafactor optimizer [58] and validation-based early stopping for model checkpoint selection.