StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses
Authors: JIANAN LI, Quan Tu, Cunli Mao, Zhengtao Yu, Ji-Rong Wen, Rui Yan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method outperforms strong baselines in dialogue tasks and achieves a 4 speedup while reducing memory usage by 18 compared to dense attention recomputation. We conduct experiments on Persona Chat [35], Multi-Session Chat (MSC) [36], Topical-Chat [37] and Multi WOZ [38] datasets. |
| Researcher Affiliation | Academia | Jia-Nan Li1 Quan Tu1 Cunli Mao2 Zhengtao Yu2 Ji-Rong Wen1 Rui Yan1 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Kunming University of Science and Technology {lijianan, quantu, jrwen, ruiyan}@ruc.edu.cn maocunli@163.com, ztyu@hotmail.com |
| Pseudocode | No | The paper describes methods using natural language and mathematical equations (e.g., attention mask definitions) but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures with structured code-like steps. |
| Open Source Code | Yes | Code: https://github.com/Jina Leejnl/Streaming Dialogue |
| Open Datasets | Yes | We conduct experiments on Persona Chat [35], Multi-Session Chat (MSC) [36], Topical-Chat [37] and Multi WOZ [38] datasets. |
| Dataset Splits | No | Appendix A, Table 5 provides details on 'Train' and 'Test' utterance counts and average lengths for the datasets used. While it clearly defines training and testing splits, it does not explicitly mention or provide details for a 'validation' dataset split. |
| Hardware Specification | Yes | Figure 5 depicts the average per-token latency and memory usage during dialogue generation with NVIDIA A100 GPU using various methods. The SMR & LMR phase requires about 2 hours on two A100-40G GPUs. Dialogue generation takes only about 15 minutes on a single A100-40G GPU. |
| Software Dependencies | No | The paper mentions using Llama-2-7B, Llama-2-7B-Chat, Llama-3-8B-Instruct, and Mistral-7B models, but it does not specify exact version numbers for any software dependencies like Python, PyTorch, TensorFlow, or specific libraries used for implementation. |
| Experiment Setup | Yes | We investigate the impact of two hyper-parameters in our method: the number of utterances in SMR samples (s) and the number of query-response pairs in LMR samples (l), both ranging from {8, 12, 16, 20, 24, 28, 32}. We only train the attention layer for 1 epoch, with the learning rate set to 5e-5, utilizing cosine annealing for adjusting the learning rate, and setting the warm-up step to 0. All models fine-tune only the attention layer for 2 epochs, with the learning rate set to 5e-5, utilizing cosine annealing for adjusting the learning rate, and setting the warm-up step to 0. |