The Representation Landscape of Few-Shot Learning and Fine-Tuning in Large Language Models
Authors: Diego Doimo, Alessandro Serra, Alessio Ansuini, Alberto Cazzaniga
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. ... We study the models of Llama3 [45], Llama2 [47] families, and Mistral [43]. ... We analyze the Massive Multitask Language Understanding question answering dataset, MMLU [42]... |
| Researcher Affiliation | Academia | Diego Doimo Alessandro Serra Alessio Ansuini Alberto Cazzaniga Area Science Park, Trieste, Italy {diego.doimo,alessandro.serra,alessio.ansuini,alberto.cazzaniga} @areasciencepark.it |
| Pseudocode | No | No pseudocode or algorithm blocks (e.g., a figure or section labeled 'Algorithm X') were found in the paper. |
| Open Source Code | Yes | We provide code to reproduce our experiments at https://github. com/diegodoimo/geometry_icl_finetuning. |
| Open Datasets | Yes | MMLU Dataset. We analyze the Massive Multitask Language Understanding question answering dataset, MMLU [42], taking the implementation of cais_mmlu from Huggingface. |
| Dataset Splits | Yes | We sample the shots from the MMLU dev set, which has five examples per subject. ... We fine-tune the models with Lo RA [48] on a data set formed by the union of the dev set and some question-answer pairs of the validation set to reach an accuracy comparable to the 5-shot one. ... The dataset on which we fine-tune the models is the union of the MMLU dev set (all five examples per subject are selected) and a subset of the MMLU validation set. |
| Hardware Specification | Yes | We run the experiments on a single Nvidia A100 GPU with a VRAM of 40GB. Extracting the hidden representation of 70 billion parameter models requires 5 A100, and their fine-tuning requires 8 A100. |
| Software Dependencies | No | The paper mentions specific software packages like 'ADP presented in D Errico et al. [11] and implemented in the DADApy package [49]' and 'We measure the ID of the hidden representations with the Gride algorithm (see Sec. 3.2)' but does not provide version numbers for these or other software dependencies. |
| Experiment Setup | Yes | The Lo RA rank is 64, α is 16, and dropout = 0.1. For the 70 billion models, we choose a rank of 128, and α is 32. ... We use a learning rate = 2 10 4; for the 70 billion models we decrease it to 1 10 4. For all the models, we apply a cosine annealing scheduler and a linear warm-up for 5% of the total iterations. We fine-tune all the models with batch size = 16 using the Adam optimizer without weight decay. |