CogLTX: Applying BERT to Long Texts

Authors: Ming Ding, Chang Zhou, Hongxia Yang, Jie Tang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments on four long-text datasets with different tasks. The boxplot in Figure 5 illustrates the statistics of the text length in the datasets. Our experiments demonstrate that Cog LTX outperforms or achieves comparable performance with the state-of-the-art results on four tasks, including News QA [44], Hotpot QA [53], 20News Groups [22] and Alibaba, with constant memory consumption regardless of the length of text.
Researcher Affiliation Collaboration Ming Ding Tsinghua University dm18@mails.tsinghua.edu.cn Chang Zhou Alibaba Group ericzhou.zc@alibaba-inc.com Hongxia Yang Alibaba Group yang.yhx@alibaba-inc.com Jie Tang Tsinghua University jietang@tsinghua.edu.cn
Pseudocode Yes Algorithm 1: The Training Algorithm of Cog LTX
Open Source Code Yes 1Codes are available at https://github.com/Sleepychord/Cog LTX.
Open Datasets Yes We conducted experiments on four long-text datasets with different tasks. The boxplot in Figure 5 illustrates the statistics of the text length in the datasets. News QA [44]: A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191 200, 2017. Hotpot QA [53]: Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369 2380, 2018. 20News Groups [22]: K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, pages 331 339, 1995.
Dataset Splits Yes Table 2: Results on Hotpot QA distractor (dev).
Hardware Specification Yes The data about memory are measured with batch size = 1 on a Tesla V100.
Software Dependencies No The paper mentions using Adam [18] for finetuning but does not specify version numbers for any key software components, libraries, or programming languages.
Experiment Setup Yes In all experiments, the judge and reasoner are finetuned by Adam [18] with learning rate 4 10 5 and 10 4 respectively. The learning rates warmup over the first 10% steps, and then linearly decay to 1/10 of the max learning rates. The common hyperparameters are batch size = 32, strides= [3, 5], tup = 0.2 and tdown = 0.05.