The Representation Landscape of Few-Shot Learning and Fine-Tuning in Large Language Models

Authors: Diego Doimo, Alessandro Serra, Alessio Ansuini, Alberto Cazzaniga

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. ... We study the models of Llama3 [45], Llama2 [47] families, and Mistral [43]. ... We analyze the Massive Multitask Language Understanding question answering dataset, MMLU [42]...
Researcher Affiliation Academia Diego Doimo Alessandro Serra Alessio Ansuini Alberto Cazzaniga Area Science Park, Trieste, Italy {diego.doimo,alessandro.serra,alessio.ansuini,alberto.cazzaniga} @areasciencepark.it
Pseudocode No No pseudocode or algorithm blocks (e.g., a figure or section labeled 'Algorithm X') were found in the paper.
Open Source Code Yes We provide code to reproduce our experiments at https://github. com/diegodoimo/geometry_icl_finetuning.
Open Datasets Yes MMLU Dataset. We analyze the Massive Multitask Language Understanding question answering dataset, MMLU [42], taking the implementation of cais_mmlu from Huggingface.
Dataset Splits Yes We sample the shots from the MMLU dev set, which has five examples per subject. ... We fine-tune the models with Lo RA [48] on a data set formed by the union of the dev set and some question-answer pairs of the validation set to reach an accuracy comparable to the 5-shot one. ... The dataset on which we fine-tune the models is the union of the MMLU dev set (all five examples per subject are selected) and a subset of the MMLU validation set.
Hardware Specification Yes We run the experiments on a single Nvidia A100 GPU with a VRAM of 40GB. Extracting the hidden representation of 70 billion parameter models requires 5 A100, and their fine-tuning requires 8 A100.
Software Dependencies No The paper mentions specific software packages like 'ADP presented in D Errico et al. [11] and implemented in the DADApy package [49]' and 'We measure the ID of the hidden representations with the Gride algorithm (see Sec. 3.2)' but does not provide version numbers for these or other software dependencies.
Experiment Setup Yes The Lo RA rank is 64, α is 16, and dropout = 0.1. For the 70 billion models, we choose a rank of 128, and α is 32. ... We use a learning rate = 2 10 4; for the 70 billion models we decrease it to 1 10 4. For all the models, we apply a cosine annealing scheduler and a linear warm-up for 5% of the total iterations. We fine-tune all the models with batch size = 16 using the Adam optimizer without weight decay.