reproducibilityindex.ai

Pre-training Text-to-Text Transformers for Concept-centric Common Sense

Authors: Wangchunshu Zhou, Dong-Ho Lee, Ravi Kiran Selvam, Seyeon Lee, Xiang Ren

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results show that our method, concept-aware language model (CALM)1, can pack more commonsense knowledge into the parameters of a pre-trained text-to-text transformer without relying on external knowledge graphs, yielding better performance on both NLU and NLG tasks. We show that while only incrementally pre-trained on a relatively small corpus for a few steps, CALM outperforms baseline methods by a consistent margin and even comparable with some larger PTLMs, which suggests that CALM can serve as a general, plug-and-play method for improving the commonsense reasoning ability of a PTLM.
Researcher Affiliation	Academia	Wangchunshu Zhou1 , Dong-Ho Lee2 , Ravi Kiran Selvam2, Seyeon Lee2, Bill Yuchen Lin2, Xiang Ren2 1 Beihang University 2 University of Southern California zhouwangchunshu@buaa.edu.cn, {dongho.lee, xiangren}@usc.edu
Pseudocode	Yes	Algorithm 1: Pre-training Concept Aware Language Model (CALM). Input: Text-to-Text Transformer Tθ, Text corpus X=[x1, x2,. . . , xn]. repeat for each xi X do Extract the concept-set Ci; Construct the distractor sentence x = CONCEPT-PERMUTE(xi, Ci); Update Tθ with Eq.(1, 2, 4); until maximum iterations reached; repeat for each xi X do Update Tθ with Eq.(7) until maximum iterations reached;
Open Source Code	No	Code will be published at: https://github.com/INK-USC/CALM
Open Datasets	Yes	We randomly sample 500K sentences from the English Wikipedia corpus3, which is used for pre-training BERT and its variants, as the source dataset for our proposed self-supervised objectives which serve as intermediate tasks. ... 3https://dumps.wikimedia.org/enwiki/latest/ ... We consider ﬁve commonsense benchmark datasets as target tasks. ... Commonsense QA (Talmor et al., 2018), Openbook QA (Mihaylov et al., 2018), PIQA (Bisk et al., 2020), a NLI (Bhagavatula et al., 2019) and one dataset for generative task: Common GEN (Lin et al., 2020).
Dataset Splits	Yes	Datasets We consider ﬁve commonsense benchmark datasets as target tasks. ... Details on datasets are discussed in Appendix A.3. ... Table 8: Properties of Commonsense benchmark datasets. Dataset Train Development Test Source Example Target Example Commonsense QA 9,741 1,221 1,140 ... Openbook QA 4,957 500 500 ... PIQA 16,113 1,838 3,084 ... a NLI 169,654 1,532 3,040 ... Common GEN 67,389 4,018 6,042 ... We tune the hyperparameters based on the models performance on a in-house split dev set.
Hardware Specification	Yes	We train the models with 8 V100 GPUs and FP32 precision for 17 hours. ... For ﬁne-tuning, we use 4 V100 GPUs and use FP32.
Software Dependencies	No	We implement our pre-train models using Pytorch-lightning (Falcon, 2019) and Hugginface s Pytorch Transformers (Wolf et al., 2019). While these software components are mentioned with their respective authors and publication years, specific version numbers for Pytorch-lightning and Huggingface's Pytorch Transformers are not provided, which prevents full reproducibility.
Experiment Setup	Yes	For pre-training phase, we use the Adam optimizer with maximum sequence length 256, train batch size 8, gradient accumulation 8, warmup steps 10000, weight decay 0.01 and adam epsilon 1e-6. ... For ﬁne-tuning, we use 4 V100 GPUs and use FP32. For all discriminative tasks, we use the Adam optimizer with maximum sequence length 256, batch size 4 and gradient accumulation 16. For generative task, we use the Adam optimizer with maximum source length 32, maximum target length 32, batch size 8, gradient accumulation 16. For all tasks, we use warmup fraction 0.01. Learning rates and train epochs are listed in Table 7.