Mixture of In-Context Experts Enhance LLMs' Long Context Awareness
Authors: Hongzhan Lin, Ang Lv, yuhan chen, chen zhu, Yang Song, Hengshu Zhu, Rui Yan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we present Mixture of In-Context Experts (Mo ICE), a novel plug-in of LLMs for enhancing context awareness. ... When applying Mo ICE to open-source LLMs, we freeze LLMs parameters and conduct lightweight training only on the Mo ICE routers. With only a few quick updates, Mo ICE surpasses many competitive baselines in tasks involving long-context generation and understanding. To evaluate the efficacy of Mo ICE, we implement it with open-source LLMs, which we will introduce later, and conduct lightweight training of Mo ICE routers on a small and general dataset. Subsequently, we evaluate the enhanced LLM s capability to zero-shot undertake multiple tasks in long context understanding and generation, as detailed in Section 4.2 and Section 4.3. |
| Researcher Affiliation | Collaboration | Hongzhan Lin1 Ang Lv1 Yuhan Chen2 Chen Zhu3 Yang Song4 Hengshu Zhu3 Rui Yan1 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Xiao Mi AI Lab 3 Career Science Lab, BOSS Zhipin 4 NLP Center, BOSS Zhipin |
| Pseudocode | No | The paper does not include a dedicated section or figure for pseudocode or an algorithm block. |
| Open Source Code | Yes | Code is available at https://github.com/p1nksnow/Mo ICE. |
| Open Datasets | Yes | We use a training dataset3 which extracts the one thousand longest entries from Open Hermes [41]. (footnote 3: https://huggingface.co/datasets/Hugging Face H4/Open Hermes-2.5-1k-longest) and Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. |
| Dataset Splits | No | The paper describes the training data for the Mo ICE router ("one thousand longest entries from Open Hermes [41]") and evaluates on various benchmarks (L-Eval, Long Bench, MDQA) in a zero-shot manner, which typically implies using test sets. However, it does not explicitly specify a validation dataset split for either the router training or the benchmark evaluations. |
| Hardware Specification | Yes | All methods are tested on a single A800-80G GPU, except for applying AB to Mistral-7B-8k, which needs 2 GPUs due to substantial memory requirements. We train the Mo ICE routers for 1 epoch (about 8 minutes) on four A800-80G GPUs. |
| Software Dependencies | No | The paper mentions using "Flash Attention 2 [13]" but does not provide specific version numbers for this or any other software dependencies like Python or other libraries. |
| Experiment Setup | Yes | We implement a warm-up strategy comprising 20% of the total steps, with a maximum learning rate of 0.0001. The batch size is 128. α is set as 0.3. We train the Mo ICE routers for 1 epoch (about 8 minutes) on four A800-80G GPUs. Following Attention Buckets [8], we employed the Ro PE angle set of N = 7 items, each assigned with base values as follows: {1.0 104, 1.75 104, 1.8 104, 1.9 104, 2.0 104, 2.25 104, 2.5 104} for Llama2-7B and Mistral-7B, {1.0 106, 1.25 106, 1.4 106, 1.8 106, 1.9 106, 2.25 106, 2.5 106} for Qwen1.5-7B. |