Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
***FastDoc***: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy
Authors: Abhilash Nandy, Manav Nitin Kapadnis, Sohan Patnaik, Yash Parag Butala, Pawan Goyal, Niloy Ganguly
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform such domain-specific pre-training on three different domains namely customer support, scientific, and legal domains, and compare performance on 6 different downstream tasks and 9 different datasets. The novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around 1,000, 4,500, and 500 times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively1. The reduced training time does not lead to a deterioration in performance. In fact we show that Fast Doc either outperforms or performs on par with several competitive transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains. Moreover, reduced training aids in mitigating the risk of catastrophic forgetting. |
| Researcher Affiliation | Academia | Abhilash Nandy EMAIL Department of Computer Science Indian Institute of Technology Kharagpur Manav Nitin Kapadnis EMAIL School of Computer Science Carnegie Mellon University Sohan Patnaik EMAIL Department of Mechanical Engineering Indian Institute of Technology Kharagpur Yash Parag Butala EMAIL School of Computer Science Carnegie Mellon University Pawan Goyal EMAIL Department of Computer Science Indian Institute of Technology Kharagpur Niloy Ganguly EMAIL Department of Computer Science Indian Institute of Technology Kharagpur |
| Pseudocode | No | The paper describes the Fast Doc framework and its steps in Section 2 and visualizes the pipeline in Figure 1 and Figure 4. It also presents mathematical formulas for loss functions. However, there are no explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor are there structured, step-by-step procedures formatted like code. |
| Open Source Code | Yes | 1Code and datasets are available at https://github.com/manavkapadnis/Fast Doc-Fast-Pre-training-Technique/ |
| Open Datasets | Yes | 1Code and datasets are available at https://github.com/manavkapadnis/Fast Doc-Fast-Pre-training-Technique/ We pre-train Fast Doc on a subset of the E-Manuals Corpus (Nandy et al., 2021) [...] Google Product Taxonomy (GPrT)4 (5, 583 possible hierarchies across 7 levels of hierarchy) is used [...] We pre-train Fast Doc on a subset of the ArXiv [...] ArXiv Category Taxonomy5 (consisting of 155 possible hierarchies across 3 levels of hierarchy) is used [...] We pre-train Fast Doc on a subset of the EURLEX57K dataset (Chalkidis et al., 2019) [...] The hierarchical class assignments of the documents in the EUR-Lex Dataset (Loza Mencia et al., 2010) [...] Tech QA (Castelli et al., 2020) is a span-based QA dataset [...] S10 QA Dataset (Nandy et al., 2021) consists of 904 question-answer pairs [...] We use multiple datasets from Sci BERT Benchmark Datasets (mentioned in Beltagy et al. (2019)) for training and evaluation. The following downstream tasks and corresponding datasets are used for evaluation (1) NER (Named Entity Recognition): We use the BC5CDR (Li et al., 2016), JNLPBA (Kim et al., 2004), and NCBI-Disease (DoΔan et al., 2014) NER Datasets of the Biomedical Domain. (2) REL (Relation Classification): This task predicts the type of relation between entities. The Chem Prot Dataset (Kringelum et al., 2016) from the Biomedical Domain and Sci ERC Dataset (Luan et al., 2018) from the Computer Science Domain are used for evaluation. (3) CLS (Text Classification): Sci Cite Dataset (Cohan et al., 2019) gathered from Multiple Domains is used. CUAD (Contract Understanding Atticus Dataset) (Hendrycks et al., 2021) is used [...] SQuAD 2.0 Dataset (Rajpurkar et al., 2018) GLUE (Wang et al., 2018) benchmark |
| Dataset Splits | Yes | The dataset has 600 training, 310 dev, and 490 evaluation QA pairs. The dataset is divided in the ratio of 7:2:1 into training, validation, and test sets, respectively. CUAD (Contract Understanding Atticus Dataset) (Hendrycks et al., 2021) is used, which is annotated by legal experts for the task of Legal Contract Review. It consists of 13,101 clauses across 41 types of clauses annotated from 510 contracts. Given a contract, for each type of clause, the task requires extracting relevant clauses as spans of text related to the clause type. Details of the dataset splits are given in Section D.3 of Appendix. The dataset is split 80/20 into train/test, with a small validation set for the preliminary experiments to perform hyperparameter grid search. SQuAD 2.0 is a span-based open-domain reading comprehension dataset, consisting of 130,319 training, 11,873 dev, and 8,862 test QA pairs. |
| Hardware Specification | Yes | NVIDIA Ge Force GTX 1080 Ti GPUs are used for pre-training. |
| Software Dependencies | No | The paper mentions software like BERT, RoBERTa, PyTorch, and TensorFlow implicitly (e.g., in discussion of baselines or training time in Section F's footnote: 'PyTorch, Tensor Flow'). However, specific version numbers for these software packages or other libraries are not provided. |
| Experiment Setup | Yes | We use a batch size of 32, and Adam W optimizer (Loshchilov & Hutter, 2018) with an initial learning rate of 5 10 5, which linearly decays to 0. The hyperparameters used are the same as that in Beltagy et al. (2019). Fine-tuning on SQuAD 2.0 (Rajpurkar et al., 2018). The hyperparameters used are the same as mentioned in Rajpurkar et al. (2018). Fine-tuning on Tech QA Dataset: The hyperparameters used are the ones mentioned in the default implementation of Castelli et al. (2020). For all the fine-tuning experiments on S10 QA Dataset, we use a batch size of 16 (except for the pre-trained De CLUTR model with Distil Ro BERTa BASE backbone, where a batch size of 32 is used), and train for 4 epochs with an Adam W optimizer (Loshchilov & Hutter, 2018) and an initial learning rate of 4 10 5, that decays linearly. For all such experiments, we fine-tune for 10 epochs, with a learning rate of 3 10 5, input sequence length of 512, and batch size of 32. |