Make Your LLM Fully Utilize the Context
Authors: Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, Weizhu Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILlin-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5 26.9 F1 score on Narrative QA), while maintaining a comparable performance on short-context tasks (e.g., 59.3 59.2 accuracy on MMLU). |
| Researcher Affiliation | Collaboration | National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center of Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University Microsoft Peking University |
| Pseudocode | Yes | Algorithm 1 Implementation of Raw Text Segmentation |
| Open Source Code | No | The model and evaluation data will be released. Due to internal data release policies, the training data will not be released soon. |
| Open Datasets | Yes | We take the realnewslike subset from the C4 corpus (Raffel et al., 2020) as C, and take GPT-4-Turbo (Open AI, 2023b) as the LLM to generate QA pairs. |
| Dataset Splits | No | The paper describes the data sources and total data size but does not provide explicit training, validation, or test dataset splits (e.g., percentages or counts) for its own experiments. |
| Hardware Specification | Yes | The training process is conducted on 16 nodes of 8x80G A100 GPUs with the full sharding strategy and cpu offload strategy implemented by pytorch FSDP (Zhao et al., 2023). |
| Software Dependencies | No | The paper mentions 'pytorch FSDP' and 'lm_eval' but does not specify their version numbers or other software dependencies with specific versions. |
| Experiment Setup | Yes | For hyper-parameters, we set the global batch size as 128 and conduct one-epoch training with 14K training steps. We use the cosine learning rate decay with a 1e-6 maximum learning rate and 3% warm-up steps. |