End-to-End Ontology Learning with Large Language Models

Authors: Andy Lo, Albert Q. Jiang, Wenda Li, Mateja Jamnik

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Both our quantitative and qualitative results on Wikipedia show that OLLM outperforms subtask composition methods, producing more semantically accurate ontologies while maintaining structural integrity. We further demonstrate that our model can be effectively adapted to new domains, like ar Xiv, needing only a small number of training examples.
Researcher Affiliation Academia Andy Lo University of Cambridge cyal4@cam.ac.uk Albert Q. Jiang University of Cambridge qj213@cam.ac.uk Wenda Li University of Edinburgh wenda.li@ed.ac.uk Mateja Jamnik University of Cambridge mateja.jamnik@cl.cam.ac.uk
Pseudocode No The paper describes methods and processes but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our source code and datasets are available at https://github.com/andylolu2/ollm.
Open Datasets Yes We collect the datasets for the two ontologies considered in this paper: Wikipedia categories and the ar Xiv taxonomy. We use Wikipedia for learning and in-domain evaluation, and ar Xiv for out-of-domain evaluation. To build the Wikipedia dataset, we perform a BFS traversal from its root category Main topic classifications up to depth 3. For every category encountered, we retrieve the titles and summaries (the text before the first section) of up to 5000 pages that belong in that category. The source data is obtained from the Wikipedia API.1 The ar Xiv taxonomy is available from its home page, and the source corpus is constructed from the title and abstract of all the papers uploaded to ar Xiv in the years 2020 2022 with more than or equal to 10 citations.2 In total, the Wikipedia dataset has 13886 concepts, 28375 taxonomic relations and 362067 documents, while the ar Xiv dataset has 161 concepts, 166 taxonomic relations and 126001 documents.
Dataset Splits Yes Instead, we first split the full ontology into train and test graphs, and then generate the training document-subgraph pairs. This ensures that there are sufficiently many unseen concepts (and thus relations) in the test split, as shown in Figure 3. Our method is as follows: 1. Let V top be the set of top-level nodes, that is, children of the root node. Randomly partition V top into train V top train, validation V top val , and test V top test splits in 7:3:10 ratio.
Hardware Specification Yes Computationally, OLLM required 12 A100-hours for training and 7 A100-hours for inference to generate an ontology for Wikipedia. For the Wikipedia experiment, we use Mistral 7B v0.2 (not instruction-tuned) [21] as the base model. Training takes 12 A100-hours. For inference, we use the v LLM [28] server which achieves a throughput of 10 documents per second. Inference on the validation and test splits of both datasets takes 12 A100-hours in total.
Software Dependencies No The paper mentions software like Core NLP pipeline [31], REBEL-large [7] (a model), and Mistral 7B v0.2 [21], but does not provide specific version numbers for software dependencies beyond the model versions themselves.
Experiment Setup Yes We finetune Mistral 7B v0.2 [21] with Low-Rank Adaptation [20] on the masked loss objective. The model is trained on the Wikipedia dataset for two epochs with Adam [25]. During inference, the outputs are generated with temperature 0.1 and nucleus sampling [19] top-p of 0.9. We include a finetuning baseline without the masked loss objective, denoted as Finetune. To adapt OLLM for ar Xiv, we further finetune the model on 2048 document-subgraph pairs from ar Xiv. We initialise new low-rank adaptors and train until the loss stops improving on the validation set. We name these models OLLM (transfer) and Finetune (transfer) for training with and without the masked loss objective, respectively. Full details for the Wikipedia and ar Xiv experiments can be found in Appendix A.1.2. The hyperparameters for the post-processing steps are tuned by grid search on the validation set. We sweep over α 1 geomspace(1/|Eraw|, 1, 21) and β geomspace(0.1, 1, 21) 0.1, and use the values that maximise Continuous F1. For Wikipedia, we choose the subgraph modelling path length N = 4 as it is the smallest N such that almost all edges (> 99%) occur in at least one relevant subgraph. Such criterion is used since smaller N results in smaller subgraphs, which we expect to be easier to model accurately. We choose N = 3 for ar Xiv for the same reason. Appendix A.1.1 states: