Pre-training Text-to-Text Transformers for Concept-centric Common Sense
Authors: Wangchunshu Zhou, Dong-Ho Lee, Ravi Kiran Selvam, Seyeon Lee, Xiang Ren
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results show that our method, concept-aware language model (CALM)1, can pack more commonsense knowledge into the parameters of a pre-trained text-to-text transformer without relying on external knowledge graphs, yielding better performance on both NLU and NLG tasks. We show that while only incrementally pre-trained on a relatively small corpus for a few steps, CALM outperforms baseline methods by a consistent margin and even comparable with some larger PTLMs, which suggests that CALM can serve as a general, plug-and-play method for improving the commonsense reasoning ability of a PTLM. |
| Researcher Affiliation | Academia | Wangchunshu Zhou1 , Dong-Ho Lee2 , Ravi Kiran Selvam2, Seyeon Lee2, Bill Yuchen Lin2, Xiang Ren2 1 Beihang University 2 University of Southern California zhouwangchunshu@buaa.edu.cn, {dongho.lee, xiangren}@usc.edu |
| Pseudocode | Yes | Algorithm 1: Pre-training Concept Aware Language Model (CALM). Input: Text-to-Text Transformer Tθ, Text corpus X=[x1, x2,. . . , xn]. repeat for each xi X do Extract the concept-set Ci; Construct the distractor sentence x = CONCEPT-PERMUTE(xi, Ci); Update Tθ with Eq.(1, 2, 4); until maximum iterations reached; repeat for each xi X do Update Tθ with Eq.(7) until maximum iterations reached; |
| Open Source Code | No | Code will be published at: https://github.com/INK-USC/CALM |
| Open Datasets | Yes | We randomly sample 500K sentences from the English Wikipedia corpus3, which is used for pre-training BERT and its variants, as the source dataset for our proposed self-supervised objectives which serve as intermediate tasks. ... 3https://dumps.wikimedia.org/enwiki/latest/ ... We consider five commonsense benchmark datasets as target tasks. ... Commonsense QA (Talmor et al., 2018), Openbook QA (Mihaylov et al., 2018), PIQA (Bisk et al., 2020), a NLI (Bhagavatula et al., 2019) and one dataset for generative task: Common GEN (Lin et al., 2020). |
| Dataset Splits | Yes | Datasets We consider five commonsense benchmark datasets as target tasks. ... Details on datasets are discussed in Appendix A.3. ... Table 8: Properties of Commonsense benchmark datasets. Dataset Train Development Test Source Example Target Example Commonsense QA 9,741 1,221 1,140 ... Openbook QA 4,957 500 500 ... PIQA 16,113 1,838 3,084 ... a NLI 169,654 1,532 3,040 ... Common GEN 67,389 4,018 6,042 ... We tune the hyperparameters based on the models performance on a in-house split dev set. |
| Hardware Specification | Yes | We train the models with 8 V100 GPUs and FP32 precision for 17 hours. ... For fine-tuning, we use 4 V100 GPUs and use FP32. |
| Software Dependencies | No | We implement our pre-train models using Pytorch-lightning (Falcon, 2019) and Hugginface s Pytorch Transformers (Wolf et al., 2019). While these software components are mentioned with their respective authors and publication years, specific version numbers for Pytorch-lightning and Huggingface's Pytorch Transformers are not provided, which prevents full reproducibility. |
| Experiment Setup | Yes | For pre-training phase, we use the Adam optimizer with maximum sequence length 256, train batch size 8, gradient accumulation 8, warmup steps 10000, weight decay 0.01 and adam epsilon 1e-6. ... For fine-tuning, we use 4 V100 GPUs and use FP32. For all discriminative tasks, we use the Adam optimizer with maximum sequence length 256, batch size 4 and gradient accumulation 16. For generative task, we use the Adam optimizer with maximum source length 32, maximum target length 32, batch size 8, gradient accumulation 16. For all tasks, we use warmup fraction 0.01. Learning rates and train epochs are listed in Table 7. |