CausalLM is not optimal for in-context learning

Authors: Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments verify that causal LM consistently underperforms prefix LM in all settings. ...Our experiments contain three parts.
Researcher Affiliation Industry Nan Ding Tomer Levinboim Jialin Wu Sebastian Goodman Radu Soricut Google Research {dingnan,tomerl,jialinwu,seabass,rsoricut}@google.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions 'publicly available T5 family of models (Roberts et al., 2022)' and 'https://github.com/google-research/t5x', but this refers to a framework used by the authors, not the specific source code for their own methodology or experiments.
Open Datasets Yes Note that the existing public T5X checkpoints are all based on Enc Dec models, which are similar to prefix LM. Thus, it would be unfair and unnatural to compare with causal LM by simply replacing the bidirectional attention of the encoder to the causal attention during the finetuning stage. To make a more reasonable comparison, we reran the pretraining stages of T5 on the C4 corpus (Raffel et al., 2020a) from a random initialization point using a span corruption objective, but in the Decoder Only setting.
Dataset Splits Yes The ICL training dataset contains 64,000 training sequences. Each sequence contains 40 in-context examples and 20 queries, where queries are independent of each other similar to Section 5.1. The transformers are trained with batch size 64 for 100 epochs.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments.
Software Dependencies No The paper mentions 'FLAN recipe (Chung et al., 2022)' and 'T5 family of models (Roberts et al., 2022)' but does not specify version numbers for any software dependencies.
Experiment Setup Yes We trained a few 24-layer transformers containing 128 hidden units and 2 heads. ...The transformers are trained with batch size 64 for 100 epochs. More details of the hyper-parameters of the experiments are provided in Appendix E.