Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data
Authors: Alon Albalak, Colin A. Raffel, William Yang Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experimentation we find that our methods outperform all pre-existing FLAD methods by 4% and lead to the first 3 billion parameter language models that outperform the 175 billion parameter GPT-3. |
| Researcher Affiliation | Academia | Alon Albalak University of California, Santa Barbara alon_albalak@ucsb.edu Colin Raffel University of Toronto Vector Institute craffel@gmail.com William Yang Wang University of California, Santa Barbara william@cs.ucsb.edu |
| Pseudocode | Yes | We include here pseudo-code for our 2 proposed algorithms. Algorithm 1 contains the pseudo-code for EXP3-FLAD, and Algorithm 2 contains the pseudo-code for UCB1-FLAD. |
| Open Source Code | Yes | All of our code is available at github.com/alon-albalak/FLAD. |
| Open Datasets | Yes | We obtain all datasets from Hugging Face Datasets1, and cast them to the text-to-text format by applying prompt templates from the Public Pool of Prompts (P3) [23] that was used to train T0. |
| Dataset Splits | Yes | For each dataset, we randomly sample five few-shot splits from their training data, containing the same number of training examples as previous works, between 20 to 70 [55, 56]. We further divide each split into equal training and validation partitions for true few-shot learning [57](e.g. 10 train and 10 validation samples for Hella Swag). |
| Hardware Specification | Yes | We train all models (FLAD and non-FLAD) on 40Gb A100s. |
| Software Dependencies | No | We used model checkpoints from Hugging Face Transformers [45]). For all experiments we use the Adafactor optimizer [58]. |
| Experiment Setup | Yes | For the target-only baseline, we use learning rates in {1e-4, 3e-4}. For all other methods, we always use a learning rate of 1e-4. For target-, explore-, and exploit-only baselines we use batch sizes in {32, 128}. For loss-scaling, EXP3-FLAD, and UCB1-FLAD we use mini-batches of 8 samples and let G be in {4, 16} to match the batch size of all methods. For exploreand exploit-only, we use a target dataset mixing ratio of M {1, 5, 10}. For all experiments we use the Adafactor optimizer [58] and validation-based early stopping for model checkpoint selection. |