Mixture of In-Context Experts Enhance LLMs' Long Context Awareness

Authors: Hongzhan Lin, Ang Lv, yuhan chen, chen zhu, Yang Song, Hengshu Zhu, Rui Yan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we present Mixture of In-Context Experts (Mo ICE), a novel plug-in of LLMs for enhancing context awareness. ... When applying Mo ICE to open-source LLMs, we freeze LLMs parameters and conduct lightweight training only on the Mo ICE routers. With only a few quick updates, Mo ICE surpasses many competitive baselines in tasks involving long-context generation and understanding. To evaluate the efficacy of Mo ICE, we implement it with open-source LLMs, which we will introduce later, and conduct lightweight training of Mo ICE routers on a small and general dataset. Subsequently, we evaluate the enhanced LLM s capability to zero-shot undertake multiple tasks in long context understanding and generation, as detailed in Section 4.2 and Section 4.3.
Researcher Affiliation Collaboration Hongzhan Lin1 Ang Lv1 Yuhan Chen2 Chen Zhu3 Yang Song4 Hengshu Zhu3 Rui Yan1 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Xiao Mi AI Lab 3 Career Science Lab, BOSS Zhipin 4 NLP Center, BOSS Zhipin
Pseudocode No The paper does not include a dedicated section or figure for pseudocode or an algorithm block.
Open Source Code Yes Code is available at https://github.com/p1nksnow/Mo ICE.
Open Datasets Yes We use a training dataset3 which extracts the one thousand longest entries from Open Hermes [41]. (footnote 3: https://huggingface.co/datasets/Hugging Face H4/Open Hermes-2.5-1k-longest) and Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023.
Dataset Splits No The paper describes the training data for the Mo ICE router ("one thousand longest entries from Open Hermes [41]") and evaluates on various benchmarks (L-Eval, Long Bench, MDQA) in a zero-shot manner, which typically implies using test sets. However, it does not explicitly specify a validation dataset split for either the router training or the benchmark evaluations.
Hardware Specification Yes All methods are tested on a single A800-80G GPU, except for applying AB to Mistral-7B-8k, which needs 2 GPUs due to substantial memory requirements. We train the Mo ICE routers for 1 epoch (about 8 minutes) on four A800-80G GPUs.
Software Dependencies No The paper mentions using "Flash Attention 2 [13]" but does not provide specific version numbers for this or any other software dependencies like Python or other libraries.
Experiment Setup Yes We implement a warm-up strategy comprising 20% of the total steps, with a maximum learning rate of 0.0001. The batch size is 128. α is set as 0.3. We train the Mo ICE routers for 1 epoch (about 8 minutes) on four A800-80G GPUs. Following Attention Buckets [8], we employed the Ro PE angle set of N = 7 items, each assigned with base values as follows: {1.0 104, 1.75 104, 1.8 104, 1.9 104, 2.0 104, 2.25 104, 2.5 104} for Llama2-7B and Mistral-7B, {1.0 106, 1.25 106, 1.4 106, 1.8 106, 1.9 106, 2.25 106, 2.5 106} for Qwen1.5-7B.