CausalLM is not optimal for in-context learning
Authors: Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments verify that causal LM consistently underperforms prefix LM in all settings. ...Our experiments contain three parts. |
| Researcher Affiliation | Industry | Nan Ding Tomer Levinboim Jialin Wu Sebastian Goodman Radu Soricut Google Research {dingnan,tomerl,jialinwu,seabass,rsoricut}@google.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions 'publicly available T5 family of models (Roberts et al., 2022)' and 'https://github.com/google-research/t5x', but this refers to a framework used by the authors, not the specific source code for their own methodology or experiments. |
| Open Datasets | Yes | Note that the existing public T5X checkpoints are all based on Enc Dec models, which are similar to prefix LM. Thus, it would be unfair and unnatural to compare with causal LM by simply replacing the bidirectional attention of the encoder to the causal attention during the finetuning stage. To make a more reasonable comparison, we reran the pretraining stages of T5 on the C4 corpus (Raffel et al., 2020a) from a random initialization point using a span corruption objective, but in the Decoder Only setting. |
| Dataset Splits | Yes | The ICL training dataset contains 64,000 training sequences. Each sequence contains 40 in-context examples and 20 queries, where queries are independent of each other similar to Section 5.1. The transformers are trained with batch size 64 for 100 epochs. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments. |
| Software Dependencies | No | The paper mentions 'FLAN recipe (Chung et al., 2022)' and 'T5 family of models (Roberts et al., 2022)' but does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | We trained a few 24-layer transformers containing 128 hidden units and 2 heads. ...The transformers are trained with batch size 64 for 100 epochs. More details of the hyper-parameters of the experiments are provided in Appendix E. |