Bidirectional Language Models Are Also Few-shot Learners
Authors: Ajay Patel, Bryan Li, Mohammad Sadegh Rasooli, Noah Constant, Colin Raffel, Chris Callison-Burch
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Utilizing the machine translation task as a case study, we prompt the bidirectional m T5 model (Xue et al., 2021) with SAP and demonstrate its fewshot and zero-shot translations outperform the few-shot translations of unidirectional models like GPT-3 and XGLM (Lin et al., 2021), despite m T5 s approximately 50% fewer parameters. We further show SAP is effective on question answering and summarization. |
| Researcher Affiliation | Collaboration | Ajay Patel University of Pennsylvania Bryan Li University of Pennsylvania Mohammad Sadegh Rasooli Microsoft Noah Constant Google Research Colin Raffel UNC Chapel Hill Chris Callison-Burch University of Pennsylvania |
| Pseudocode | No | The paper describes the SAP technique and bootstrapping process in detail with text and figures (Figure 1, Figure 2) but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper lists resources (models and software) that were utilized, along with their GitHub links (e.g., 'm T5 (Xue et al., 2021): https://github.com/google-research/multilingual-t5/', 'Transformers (Wolf et al., 2019): https://github.com/huggingface/transformers'). However, it does not provide a link to the authors' own implementation code for the methodology described in the paper, nor does it explicitly state that their code is being released. |
| Open Datasets | Yes | We evaluate our technique on 14 languages from the FLORES-101 dataset (Goyal et al., 2021) that span high-resource and low-resource languages. We ultimately will replace the examples in the few-shot prompt with synthetic parallel examples. We evaluate our technique on 14 languages from the FLORES-101 dataset (Goyal et al., 2021). |
| Dataset Splits | No | The paper refers to using 'FLORES-101 devtest', 'SQu AD v1.1 dev set', and 'CNN / Daily Mail v3.0.0 test set', which are standard benchmark splits. However, it does not explicitly provide the specific percentages, sample counts, or detailed methodology for how these dataset splits were created, only referring to them by their common names. |
| Hardware Specification | Yes | We estimate the total compute budget and detail computing infrastructure used to run the computational experiments found in this paper below: 1x NVIDIA RTX A6000 / 87GB RAM / 4x CPU 686 hours |
| Software Dependencies | No | The paper lists software used (e.g., 'Transformers (Wolf et al., 2019)', 'Datasets (Lhoest et al., 2021)', 'Sacre BLEU (Post, 2018; Goyal et al., 2021)'), but does not provide specific version numbers for these software packages, which is necessary for reproducibility. |
| Experiment Setup | Yes | At each time step of iteration, we keep the first word generated (using the space character as delimiter) and concatenate it into the last line of the prompt to use in the next time step. This iterative prompting enables us to extract longer generations. In our experiments, we run four bootstrapping rounds and sample 100 source language sentences from the training dataset in each round. We take the top two synthetic parallel examples with the highest m T5Score in the filtering step and use those as synthetic few-shot examples in the prompt in the self-amplification step. Unless otherwise stated, we use a 4 prompt ensemble in this paper: N=8. |