Semantic Self-Segmentation for Abstractive Summarization of Long Documents in Low-Resource Regimes

Authors: Gianluca Moro, Luca Ragazzi11085-11093

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental outcomes show the approach significantly improves the performance of abstractive summarization transformers, even with just a dozen of labeled data, achieving new state-of-the-art results on two legal datasets of different domains and contents.
Researcher Affiliation Collaboration 1Department of Computer Science and Engineering, University of Bologna, Cesena Campus Via dell Universit a 50, I-47522 Cesena, Italy 2CNIT {gianluca.moro, l.ragazzi}@unibo.it
Pseudocode Yes Algorithm 1: Semantic Self-Segmentation Input: model; doc sentences; summary sentences Parameters: Ls lower size; Us upper size Output: The chunk-target pairs
Open Source Code No The solution website (https://disi-unibo-nlp.github.io/projects/se3) linked in the paper states "Code and data coming soon." indicating no concrete access to the source code at the time of publication.
Open Datasets Yes Australian Legal Case Reports, referenced as Aust LII and publicly downloadable from the UCI archive,3 is a corpus of around 4000 legal cases from the Federal Court of Australia. [...] Bill Sum (Kornilova and Eidelman 2019), downloadable from the Hugging Face library and already split into 18,949 ( ≈ 85%) documents for training and 3,269 ( ≈ 15%) for testing,4 consists of 22218 US Congressional Bills with human-written references. [...] We used a dataset comprised of sentence triplets from Wikipedia articles (Ein-Dor et al. 2018) for metric learning.
Dataset Splits Yes Bill Sum (Kornilova and Eidelman 2019), downloadable from the Hugging Face library and already split into 18,949 ( ≈ 85%) documents for training and 3,269 ( ≈ 15%) for testing,4 consists of 22218 US Congressional Bills with human-written references. The statistics of the datasets show that the Aust LII documents are much longer than the Bill Sum ones (Table 1). We collected 1754 documents, split into 1578 (90%) for training and 176 (10%) for testing.
Hardware Specification Yes Given the complexity of summarizing long legal documents, we experiment on two legal datasets of different domain and content sizes, using Se3 combined with BART (Lewis et al. 2020) and LED (Beltagy, Peters, and Cohan 2020) on a single Titan Xp GPU of 12GB memory.
Software Dependencies No The paper mentions using the "Hugging Face library" for training BART and LED models but does not provide specific version numbers for this or any other software dependencies.
Experiment Setup Yes We experimented with six chunk size ranges, expressed in the number of tokens, by segmenting input documents based on the following sizes: 64-128, 128-256, 256-512, 512-1024, 1024-2048, and 2048-4096. [...] We trained LEGAL-BERT for 1 epoch for metric learning using a batch size of 8 and a learning rate set to 2 10 5. About abstractive summarization, we trained BART and LED for all experiments using the Hugging Face library. All models are fine-tuned for 5 epochs using a batch size of 1 and a learning rate with a linear schedule set to 5 10 5. At inference time, we used 2 as beam size and length penalty.