Switchable Decision: Dynamic Neural Generation Networks
Authors: Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across question answering, summarization, and classification benchmarks show that our method benefits from less computation cost during inference while keeping the same accuracy. |
| Researcher Affiliation | Academia | 1The University of Texas at Austin. Correspondence to: Shujian Zhang <szhang19@utexas.edu>. |
| Pseudocode | Yes | Algorithm 1 Switchable Decision (SD) |
| Open Source Code | No | The paper uses and cites external libraries like Fairseq and Hugging Face Transformers but does not provide a specific link or explicit statement about the availability of the source code for their proposed 'Switchable Decision' method. |
| Open Datasets | Yes | Summarization. We use CNN/Daily Mail (Hermann et al., 2015) and XSum (Narayan et al., 2018) to evaluate our method. Question Answering. The Stanford Question Answering Datasets (SQu AD) v1.1 and v2.0 (Rajpurkar et al., 2016; 2018; Fan et al., 2020) are popular machine reading comprehension benchmarks. Classification. The General Language Understanding Evaluation (GLUE) benchmark is a collection of natural language understanding (NLU) tasks. As shown in Table 1, we include Multi-Genre NLI (MNLI; (Williams et al., 2017b; Zhang et al., 2021d)), Recognizing Textual Entailment (RTE; (Dagan et al., 2005)), and Stanford Sentiment Treebank (SST; (Socher et al., 2013)). |
| Dataset Splits | Yes | Table 1. Dataset Configuration. The top block is for summarization, the middle block is for question answering, and the bottom block is the classification tasks. Summarization CNN/Daily Mail 287.2K 13.4K 11.5k XSum 204K 11.3K 11.3K Question Answering SQu AD 1.1 87.6K 10.5K 9.5k SQu AD 2.0 130.3K 11.9K 8.9K Classification RTE 2.5K 276 3k MNLI 393K 20K 20K SST 67K 872 1.8K |
| Hardware Specification | Yes | Experiments in this part are performed on eight Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Fairseq library' and 'Hugging Face Transformer library' and the 'Adam optimizer' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Following Lewis et al. (2019), we take the pre-trained BART model as the backbone and utilize the provided checkpoint for finetuning on the downstream datasets...Specifically, in summarization, we set the training steps as 50k and the number of warm-up steps as 500. The max number of tokens and the update frequency are set to be 2,048 and 4, respectively. The learning rate is set to 3 10 5. For the question answering (SQu AD 1.1/2.0). We set the total number of updates and warm-up updates as 5,430 and 326, respectively. The max number of sentences is 3 per device with an update frequency of 2. The learning rate is 1.5 10 5. We refer the readers to Appendix A for classification hyper-parameter configurations, and more details about the settings. ... Table 15. Experiment setting for MNLI, RTE, and SST-2 (LR: learning rate, BSZ: batch size, NC: number of classes, TS: total number of training steps, WS: warm-up steps). |