Bidirectional Language Models Are Also Few-shot Learners

Authors: Ajay Patel, Bryan Li, Mohammad Sadegh Rasooli, Noah Constant, Colin Raffel, Chris Callison-Burch

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Utilizing the machine translation task as a case study, we prompt the bidirectional m T5 model (Xue et al., 2021) with SAP and demonstrate its fewshot and zero-shot translations outperform the few-shot translations of unidirectional models like GPT-3 and XGLM (Lin et al., 2021), despite m T5 s approximately 50% fewer parameters. We further show SAP is effective on question answering and summarization.
Researcher Affiliation Collaboration Ajay Patel University of Pennsylvania Bryan Li University of Pennsylvania Mohammad Sadegh Rasooli Microsoft Noah Constant Google Research Colin Raffel UNC Chapel Hill Chris Callison-Burch University of Pennsylvania
Pseudocode No The paper describes the SAP technique and bootstrapping process in detail with text and figures (Figure 1, Figure 2) but does not include formal pseudocode or algorithm blocks.
Open Source Code No The paper lists resources (models and software) that were utilized, along with their GitHub links (e.g., 'm T5 (Xue et al., 2021): https://github.com/google-research/multilingual-t5/', 'Transformers (Wolf et al., 2019): https://github.com/huggingface/transformers'). However, it does not provide a link to the authors' own implementation code for the methodology described in the paper, nor does it explicitly state that their code is being released.
Open Datasets Yes We evaluate our technique on 14 languages from the FLORES-101 dataset (Goyal et al., 2021) that span high-resource and low-resource languages. We ultimately will replace the examples in the few-shot prompt with synthetic parallel examples. We evaluate our technique on 14 languages from the FLORES-101 dataset (Goyal et al., 2021).
Dataset Splits No The paper refers to using 'FLORES-101 devtest', 'SQu AD v1.1 dev set', and 'CNN / Daily Mail v3.0.0 test set', which are standard benchmark splits. However, it does not explicitly provide the specific percentages, sample counts, or detailed methodology for how these dataset splits were created, only referring to them by their common names.
Hardware Specification Yes We estimate the total compute budget and detail computing infrastructure used to run the computational experiments found in this paper below: 1x NVIDIA RTX A6000 / 87GB RAM / 4x CPU 686 hours
Software Dependencies No The paper lists software used (e.g., 'Transformers (Wolf et al., 2019)', 'Datasets (Lhoest et al., 2021)', 'Sacre BLEU (Post, 2018; Goyal et al., 2021)'), but does not provide specific version numbers for these software packages, which is necessary for reproducibility.
Experiment Setup Yes At each time step of iteration, we keep the first word generated (using the space character as delimiter) and concatenate it into the last line of the prompt to use in the next time step. This iterative prompting enables us to extract longer generations. In our experiments, we run four bootstrapping rounds and sample 100 source language sentences from the training dataset in each round. We take the top two synthetic parallel examples with the highest m T5Score in the filtering step and use those as synthetic few-shot examples in the prompt in the self-amplification step. Unless otherwise stated, we use a 4 prompt ensemble in this paper: N=8.