Efficient Sketches for Training Data Attribution and Studying the Loss Landscape

Authors: Andrea Schioppa

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To thoroughly evaluate the proposed sketching methods, we present a comprehensive set of experiments. First, we highlight the limitations of existing TDA scaling strategies (Sec. 5.2). Next, we dissect the impact of specific design choices on our sketches (Sec. 5.3). We then introduce and validate an algorithm for intrinsic dimension estimation, enabling computational savings (Sec. 5.4) and show-casing that the intrinsic dimensionality of generative tasks can be large. Finally, we apply our techniques to explore the evolution of the Hessian spectrum during pre-trained language model fine-tuning (Sec. 5.5).
Researcher Affiliation Industry Andrea Schioppa Google Deep Mind Amsterdam, the Netherlands arischioppa@google.com
Pseudocode Yes Listing 9: An algorithm that searches the intrinsic dimension
Open Source Code Yes Python code to implement the proposed algorithms (in Jax) is provided in Appendix B.
Open Datasets Yes We adopt the setup of [6]: a generative task fine-tuning GPT-2 on the Wiki Text-103 dataset (BART and zs RE results in Appendix A). Our experiments evaluate the efficiency and accuracy of our intrinsic dimension estimation algorithm (presented in Sec. 4). We consider two experimental setups: classification, where we fine-tune Roberta on SNLI with accuracy as the target metric; generation, where we fine-tune BART on XSUM for text summarization, using Rouge1 and Rouge2 for evaluation.
Dataset Splits No The paper mentions training and evaluating on datasets like Wiki Text-103, SNLI, and XSUM, and refers to 'fine-tuning' and 'evaluation', but does not explicitly state the training/validation/test dataset splits (e.g., percentages or sample counts) used for these experiments.
Hardware Specification Yes ALGO GPU (V100) TPU (V2) T (MS) M (GB) ... GPU is V100, TPU is TPUv2. ... Experiments in Sec. 5.4 used 2 V100s in the classification setting and 2 A100s in the generation setting. Experiments in Sec. 5.5 used 2 A100s.
Software Dependencies No We use Jax and Hugging Face libraries; experiments in Sec. 5.2 were carried out using one GPU V100 or a TPUv2 (8 cores).
Experiment Setup Yes Appendix B.6, B.7, and B.8 provide detailed hyper-parameters for the experiments. For example: 'Roberta was fine-tuned with a batch size of 32 for 10k steps with Adam and a constant learning rate of 2 10 5. For the search algorithm 9 the learning rate was increased to 10 4, δ = 0.1 and c = 2k steps.'