A Character-Level Length-Control Algorithm for Non-Autoregressive Sentence Summarization

Authors: Puyuan Liu, Xiang Zhang, Lili Mou

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated our NACC model on the Gigaword headline generation [10] and DUC2004 [27] datasets in two settings: supervised and unsupervised. Experiments show that NACC establishes the state-of-the-art performance of non-autoregressive summarization under various target lengths in both settings; NACC even outperforms autoregressive Transformers [37] in the unsupervised setting, where the input and output have stronger correspondence.
Researcher Affiliation Academia Puyuan Liu, Xiang Zhang, Lili Mou Dept. Computing Science, Alberta Machine Intelligence Institute (Amii) University of Alberta, Canada Canada CIFAR AI Chair, Amii {puyuan, xzhang23}@ualberta.ca, doublepower.mou@gmail.com
Pseudocode No The paper describes the proposed dynamic programming algorithm in detail using text and equations (e.g., Section 3.2, Figure 1b) but does not present it in a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes 1Our code, model, and output are released at: https://github.com/MANGA-UOFA/NACC
Open Datasets Yes Our model is evaluated on the Gigaword headline generation [30] and DUC2004 datasets [27].
Dataset Splits Yes In total, the dataset contains 3.8M, 198K, and 1951 samples for training, validation, and test, respectively.
Hardware Specification Yes All experiments were run on an i9-9940X CPU and an RTX6000 GPU.
Software Dependencies No The paper mentions using a 'Transformer encoder' and references its original paper ('Attention is all you need' [37]), but it does not specify any ancillary software names with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed for replication.
Experiment Setup Yes We use a Transformer encoder as the base model, which has 6 layers and 8 attention heads for each layer, following the settings in [37]. The dimensions are 512 and 2048 for the attention and feed-forward modules, respectively. Each training batch contains samples amounting to 4K tokens. The learning rate is chosen from {1e-4, 5e-4} by validation, and we ran 100K gradient updates for the unsupervised setting, but 400K updates for the supervised setting. For our length-control algorithm, we adopt a bucket size of 4, and only consider the most probable 20 words for every generation slot (cf. ws in Eqn. 6) due to efficiency concerns.