StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses

Authors: JIANAN LI, Quan Tu, Cunli Mao, Zhengtao Yu, Ji-Rong Wen, Rui Yan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method outperforms strong baselines in dialogue tasks and achieves a 4 speedup while reducing memory usage by 18 compared to dense attention recomputation. We conduct experiments on Persona Chat [35], Multi-Session Chat (MSC) [36], Topical-Chat [37] and Multi WOZ [38] datasets.
Researcher Affiliation Academia Jia-Nan Li1 Quan Tu1 Cunli Mao2 Zhengtao Yu2 Ji-Rong Wen1 Rui Yan1 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Kunming University of Science and Technology {lijianan, quantu, jrwen, ruiyan}@ruc.edu.cn maocunli@163.com, ztyu@hotmail.com
Pseudocode No The paper describes methods using natural language and mathematical equations (e.g., attention mask definitions) but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures with structured code-like steps.
Open Source Code Yes Code: https://github.com/Jina Leejnl/Streaming Dialogue
Open Datasets Yes We conduct experiments on Persona Chat [35], Multi-Session Chat (MSC) [36], Topical-Chat [37] and Multi WOZ [38] datasets.
Dataset Splits No Appendix A, Table 5 provides details on 'Train' and 'Test' utterance counts and average lengths for the datasets used. While it clearly defines training and testing splits, it does not explicitly mention or provide details for a 'validation' dataset split.
Hardware Specification Yes Figure 5 depicts the average per-token latency and memory usage during dialogue generation with NVIDIA A100 GPU using various methods. The SMR & LMR phase requires about 2 hours on two A100-40G GPUs. Dialogue generation takes only about 15 minutes on a single A100-40G GPU.
Software Dependencies No The paper mentions using Llama-2-7B, Llama-2-7B-Chat, Llama-3-8B-Instruct, and Mistral-7B models, but it does not specify exact version numbers for any software dependencies like Python, PyTorch, TensorFlow, or specific libraries used for implementation.
Experiment Setup Yes We investigate the impact of two hyper-parameters in our method: the number of utterances in SMR samples (s) and the number of query-response pairs in LMR samples (l), both ranging from {8, 12, 16, 20, 24, 28, 32}. We only train the attention layer for 1 epoch, with the learning rate set to 5e-5, utilizing cosine annealing for adjusting the learning rate, and setting the warm-up step to 0. All models fine-tune only the attention layer for 2 epochs, with the learning rate set to 5e-5, utilizing cosine annealing for adjusting the learning rate, and setting the warm-up step to 0.