Focused Transformer: Contrastive Training for Context Scaling

Authors: Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, Piotr Miłoś

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we perform extensive experiments on smaller models to analyze and further validate our approach.
Researcher Affiliation Collaboration 1IDEAS NCBR 2Institute of Mathematics, Polish Academy of Sciences 3University of Warsaw 4Google Deep Mind 5deepsense.ai 6x AI
Pseudocode Yes Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of 3B and 7B Open LLa MA checkpoints. The resulting models, which we name LONGLLAMA2, exhibit advancements in tasks requiring a long context. We further illustrate that our LONGLLAMA models adeptly manage a 256k context length for passkey retrieval. ... See Figure 2 for an overview of the FOT architecture and Appendix L for pseudocode.
Open Source Code Yes We release the checkpoints and source code of LONGLLAMA. ... We release the inference code on Git Hub: https://github.com/CStan Konrad/long_llama and the LONGLLAMA-3B checkpoint on Hugging Face: https://huggingface.co/syzymon/long_llama_3b.
Open Datasets Yes The data used for both fine-tuning and pre-training is the C4 dataset Raffel et al. [2019a]... Our dataset mixture based on Red Pajama [Together Computer, 2023] and The Stack [Kocetkov et al., 2022]... We evaluate on the following long-context language modeling datasets: PG-19 (English books), ar Xiv (mathematical papers), Git Hub (code), and Isabelle (formal proofs).
Dataset Splits No In Table 6 we present the performance on the validation set of Qasper [Dasigi et al., 2021] from SCROLLS [Shaham et al., 2022] and compare our results to Long Chat 7B [Ma and Zhang, 2023] and two baseline short-context models. We note that our model shows gains from increased context length. ... Although a validation set is mentioned, the paper does not provide explicit details about its size, percentage, or how the split was performed, which is required for reproducibility.
Hardware Specification Yes We used TPU virtual machines from the Google Cloud Platform (GCP). Each TPU virtual machine has 8 TPUv2 / TPUv3 cores totaling 64GB / 128GB of device memory, 96 CPU cores, and over 300GB of RAM. In larger-scale experiments (Section 5.2) we used machines with 32 TPUv3 cores. For training the LONGLLAMA checkpoints, a TPUv3-128 pod provided by the TPU Research Cloud was used, which we gratefully acknowledge.
Software Dependencies No The paper mentions several components like 'RMSNorm,' 'SiLU activation,' 'Sentence Piece tokenizer,' and 'FAISS' but does not specify version numbers for general software dependencies such as Python, PyTorch, or TensorFlow, or for the specific libraries it uses. For example, 'We use the exact k NN search implemented in FAISS [Johnson et al., 2017]' mentions FAISS but not its version.
Experiment Setup Yes We use L = {6, 12, 18} (resp. L = {8, 16, 24}) as the memory layers for 3B (resp. 7B) LONGLLAMA model. We fine-tune the models on 10B (resp. 3B) tokens using FOT, 8k context length and our dataset mixture based on Red Pajama [Together Computer, 2023], see Appendix A.3. During fine-tuning, we use a batch size of 256K tokens and constant learning rate of 2e 5... and weight decay of 0.01. ... The hyperparameters for each model size can be found in Appendix E. Table 9 shows hyperparameters used in our experiments. We used context length 512 unless stated otherwise. ... Optimizer Ada Factor, Learning rate schedule Inverse Square Root, Warmup steps 1000.