Focused Transformer: Contrastive Training for Context Scaling
Authors: Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, Piotr Miłoś
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we perform extensive experiments on smaller models to analyze and further validate our approach. |
| Researcher Affiliation | Collaboration | 1IDEAS NCBR 2Institute of Mathematics, Polish Academy of Sciences 3University of Warsaw 4Google Deep Mind 5deepsense.ai 6x AI |
| Pseudocode | Yes | Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of 3B and 7B Open LLa MA checkpoints. The resulting models, which we name LONGLLAMA2, exhibit advancements in tasks requiring a long context. We further illustrate that our LONGLLAMA models adeptly manage a 256k context length for passkey retrieval. ... See Figure 2 for an overview of the FOT architecture and Appendix L for pseudocode. |
| Open Source Code | Yes | We release the checkpoints and source code of LONGLLAMA. ... We release the inference code on Git Hub: https://github.com/CStan Konrad/long_llama and the LONGLLAMA-3B checkpoint on Hugging Face: https://huggingface.co/syzymon/long_llama_3b. |
| Open Datasets | Yes | The data used for both fine-tuning and pre-training is the C4 dataset Raffel et al. [2019a]... Our dataset mixture based on Red Pajama [Together Computer, 2023] and The Stack [Kocetkov et al., 2022]... We evaluate on the following long-context language modeling datasets: PG-19 (English books), ar Xiv (mathematical papers), Git Hub (code), and Isabelle (formal proofs). |
| Dataset Splits | No | In Table 6 we present the performance on the validation set of Qasper [Dasigi et al., 2021] from SCROLLS [Shaham et al., 2022] and compare our results to Long Chat 7B [Ma and Zhang, 2023] and two baseline short-context models. We note that our model shows gains from increased context length. ... Although a validation set is mentioned, the paper does not provide explicit details about its size, percentage, or how the split was performed, which is required for reproducibility. |
| Hardware Specification | Yes | We used TPU virtual machines from the Google Cloud Platform (GCP). Each TPU virtual machine has 8 TPUv2 / TPUv3 cores totaling 64GB / 128GB of device memory, 96 CPU cores, and over 300GB of RAM. In larger-scale experiments (Section 5.2) we used machines with 32 TPUv3 cores. For training the LONGLLAMA checkpoints, a TPUv3-128 pod provided by the TPU Research Cloud was used, which we gratefully acknowledge. |
| Software Dependencies | No | The paper mentions several components like 'RMSNorm,' 'SiLU activation,' 'Sentence Piece tokenizer,' and 'FAISS' but does not specify version numbers for general software dependencies such as Python, PyTorch, or TensorFlow, or for the specific libraries it uses. For example, 'We use the exact k NN search implemented in FAISS [Johnson et al., 2017]' mentions FAISS but not its version. |
| Experiment Setup | Yes | We use L = {6, 12, 18} (resp. L = {8, 16, 24}) as the memory layers for 3B (resp. 7B) LONGLLAMA model. We fine-tune the models on 10B (resp. 3B) tokens using FOT, 8k context length and our dataset mixture based on Red Pajama [Together Computer, 2023], see Appendix A.3. During fine-tuning, we use a batch size of 256K tokens and constant learning rate of 2e 5... and weight decay of 0.01. ... The hyperparameters for each model size can be found in Appendix E. Table 9 shows hyperparameters used in our experiments. We used context length 512 unless stated otherwise. ... Optimizer Ada Factor, Learning rate schedule Inverse Square Root, Warmup steps 1000. |