Language Model Pre-training on True Negatives
Authors: Zhuosheng Zhang, Hai Zhao, Masao Utiyama, Eiichiro Sumita
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on GLUE and SQu AD benchmarks show that our counter-false-negative pre-training methods indeed bring about better performance together with stronger robustness. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, Shanghai Jiao Tong University 2Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China 3National Institute of Information and Communications Technology (NICT), Kyoto, Japan |
| Pseudocode | No | The paper describes methods with mathematical formulas and text but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the methodology described is publicly available. |
| Open Datasets | Yes | We use the wikitext-2-raw-v1 corpus (Merity et al. 2017) for validation. We use Open Web Text (Radford et al. 2019) to train small models, and Wikipedia and Books Corpus (Zhu et al. 2015) for training base models following (Clark et al. 2020). |
| Dataset Splits | Yes | For evaluation, we fine-tune the pre-trained models on GLUE (General Language Understanding Evaluation) (Wang et al. 2019) and SQu AD v1.1 (Rajpurkar et al. 2016) to evaluate the performance of the pre-trained models. [...] Table 3: Comparisons between our proposed methods and the baseline pre-trained models on the dev set of GLUE tasks. and Table 4: Results on the SQu AD dev set. |
| Hardware Specification | Yes | Please note that it is inadequate to pursue absolute gains for large models by using single-machine NVIDIA V100 GPUs (e.g., slower convergence speed with much smaller batch sizes), compared with TPUs for training large models in public releases (Devlin et al. 2019). |
| Software Dependencies | No | The paper mentions using ELECTRA, BERT, Word Net, and Word2Vec but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | For hyper-parameters, the batch size is 128 for the base models in our work instead of 256 as in the original setting due to limited resources. The mask ratio is 15%. We set a maximum number of tokens as 128 for small models and 512 for base models. [...] The learning rates for small and base models are 5e-4, and 5e-5, respectively. |