Efficient Sketches for Training Data Attribution and Studying the Loss Landscape
Authors: Andrea Schioppa
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To thoroughly evaluate the proposed sketching methods, we present a comprehensive set of experiments. First, we highlight the limitations of existing TDA scaling strategies (Sec. 5.2). Next, we dissect the impact of specific design choices on our sketches (Sec. 5.3). We then introduce and validate an algorithm for intrinsic dimension estimation, enabling computational savings (Sec. 5.4) and show-casing that the intrinsic dimensionality of generative tasks can be large. Finally, we apply our techniques to explore the evolution of the Hessian spectrum during pre-trained language model fine-tuning (Sec. 5.5). |
| Researcher Affiliation | Industry | Andrea Schioppa Google Deep Mind Amsterdam, the Netherlands arischioppa@google.com |
| Pseudocode | Yes | Listing 9: An algorithm that searches the intrinsic dimension |
| Open Source Code | Yes | Python code to implement the proposed algorithms (in Jax) is provided in Appendix B. |
| Open Datasets | Yes | We adopt the setup of [6]: a generative task fine-tuning GPT-2 on the Wiki Text-103 dataset (BART and zs RE results in Appendix A). Our experiments evaluate the efficiency and accuracy of our intrinsic dimension estimation algorithm (presented in Sec. 4). We consider two experimental setups: classification, where we fine-tune Roberta on SNLI with accuracy as the target metric; generation, where we fine-tune BART on XSUM for text summarization, using Rouge1 and Rouge2 for evaluation. |
| Dataset Splits | No | The paper mentions training and evaluating on datasets like Wiki Text-103, SNLI, and XSUM, and refers to 'fine-tuning' and 'evaluation', but does not explicitly state the training/validation/test dataset splits (e.g., percentages or sample counts) used for these experiments. |
| Hardware Specification | Yes | ALGO GPU (V100) TPU (V2) T (MS) M (GB) ... GPU is V100, TPU is TPUv2. ... Experiments in Sec. 5.4 used 2 V100s in the classification setting and 2 A100s in the generation setting. Experiments in Sec. 5.5 used 2 A100s. |
| Software Dependencies | No | We use Jax and Hugging Face libraries; experiments in Sec. 5.2 were carried out using one GPU V100 or a TPUv2 (8 cores). |
| Experiment Setup | Yes | Appendix B.6, B.7, and B.8 provide detailed hyper-parameters for the experiments. For example: 'Roberta was fine-tuned with a batch size of 32 for 10k steps with Adam and a constant learning rate of 2 10 5. For the search algorithm 9 the learning rate was increased to 10 4, δ = 0.1 and c = 2k steps.' |