Make Your LLM Fully Utilize the Context

Authors: Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, Weizhu Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILlin-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5 26.9 F1 score on Narrative QA), while maintaining a comparable performance on short-context tasks (e.g., 59.3 59.2 accuracy on MMLU).
Researcher Affiliation Collaboration National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center of Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University Microsoft Peking University
Pseudocode Yes Algorithm 1 Implementation of Raw Text Segmentation
Open Source Code No The model and evaluation data will be released. Due to internal data release policies, the training data will not be released soon.
Open Datasets Yes We take the realnewslike subset from the C4 corpus (Raffel et al., 2020) as C, and take GPT-4-Turbo (Open AI, 2023b) as the LLM to generate QA pairs.
Dataset Splits No The paper describes the data sources and total data size but does not provide explicit training, validation, or test dataset splits (e.g., percentages or counts) for its own experiments.
Hardware Specification Yes The training process is conducted on 16 nodes of 8x80G A100 GPUs with the full sharding strategy and cpu offload strategy implemented by pytorch FSDP (Zhao et al., 2023).
Software Dependencies No The paper mentions 'pytorch FSDP' and 'lm_eval' but does not specify their version numbers or other software dependencies with specific versions.
Experiment Setup Yes For hyper-parameters, we set the global batch size as 128 and conduct one-epoch training with 14K training steps. We use the cosine learning rate decay with a 1e-6 maximum learning rate and 3% warm-up steps.