Semantic Self-Segmentation for Abstractive Summarization of Long Documents in Low-Resource Regimes
Authors: Gianluca Moro, Luca Ragazzi11085-11093
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental outcomes show the approach significantly improves the performance of abstractive summarization transformers, even with just a dozen of labeled data, achieving new state-of-the-art results on two legal datasets of different domains and contents. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science and Engineering, University of Bologna, Cesena Campus Via dell Universit a 50, I-47522 Cesena, Italy 2CNIT {gianluca.moro, l.ragazzi}@unibo.it |
| Pseudocode | Yes | Algorithm 1: Semantic Self-Segmentation Input: model; doc sentences; summary sentences Parameters: Ls lower size; Us upper size Output: The chunk-target pairs |
| Open Source Code | No | The solution website (https://disi-unibo-nlp.github.io/projects/se3) linked in the paper states "Code and data coming soon." indicating no concrete access to the source code at the time of publication. |
| Open Datasets | Yes | Australian Legal Case Reports, referenced as Aust LII and publicly downloadable from the UCI archive,3 is a corpus of around 4000 legal cases from the Federal Court of Australia. [...] Bill Sum (Kornilova and Eidelman 2019), downloadable from the Hugging Face library and already split into 18,949 ( ≈ 85%) documents for training and 3,269 ( ≈ 15%) for testing,4 consists of 22218 US Congressional Bills with human-written references. [...] We used a dataset comprised of sentence triplets from Wikipedia articles (Ein-Dor et al. 2018) for metric learning. |
| Dataset Splits | Yes | Bill Sum (Kornilova and Eidelman 2019), downloadable from the Hugging Face library and already split into 18,949 ( ≈ 85%) documents for training and 3,269 ( ≈ 15%) for testing,4 consists of 22218 US Congressional Bills with human-written references. The statistics of the datasets show that the Aust LII documents are much longer than the Bill Sum ones (Table 1). We collected 1754 documents, split into 1578 (90%) for training and 176 (10%) for testing. |
| Hardware Specification | Yes | Given the complexity of summarizing long legal documents, we experiment on two legal datasets of different domain and content sizes, using Se3 combined with BART (Lewis et al. 2020) and LED (Beltagy, Peters, and Cohan 2020) on a single Titan Xp GPU of 12GB memory. |
| Software Dependencies | No | The paper mentions using the "Hugging Face library" for training BART and LED models but does not provide specific version numbers for this or any other software dependencies. |
| Experiment Setup | Yes | We experimented with six chunk size ranges, expressed in the number of tokens, by segmenting input documents based on the following sizes: 64-128, 128-256, 256-512, 512-1024, 1024-2048, and 2048-4096. [...] We trained LEGAL-BERT for 1 epoch for metric learning using a batch size of 8 and a learning rate set to 2 10 5. About abstractive summarization, we trained BART and LED for all experiments using the Hugging Face library. All models are fine-tuned for 5 epochs using a batch size of 1 and a learning rate with a linear schedule set to 5 10 5. At inference time, we used 2 as beam size and length penalty. |