Tangent Transformers for Composition,Privacy and Removal

Authors: Tian Yu Liu, Aditya Golatkar, Stefano Soatto

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Tangent Attention Fine-Tuning (TAFT), a method for fine-tuning linearized transformers obtained by computing a First-order Taylor Expansion around a pre-trained initialization...Furthermore, we show that, when applied to various downstream visual classification tasks, the resulting Tangent Transformer fine-tuned with TAFT can perform comparably with fine-tuning the original non-linear network. In Sec. 4.2, we show that TAFT on Tangent Transformers can attain similar performances on downstream tasks compared to non-linear fine-tuning. We show the advantages that arise from linearity for composition and parallel training in Sec. 4.3, machine unlearning in Sec. 4.4, and privacy in Sec. 4.5. We describe our implementation details in Sec. 4.1, and carry out ablation studies on our implementation choices in Sec. 4.6.
Researcher Affiliation Academia Tian Yu Liu University of California, Los Angeles tianyu@cs.ucla.edu Aditya Golatkar University of California, Los Angeles adityagolatkar@ucla.edu Stefano Soatto University of California, Los Angeles soatto@cs.ucla.edu
Pseudocode No The paper describes mathematical derivations and processes (e.g., linearizing attention, layer norm, fully-connected layers, non-linearities) but does not present them in pseudocode or a clearly labeled algorithm block format.
Open Source Code Yes Our code is available at: https://github.com/ tianyu139/tangent-model-composition
Open Datasets Yes We evaluate on the following datasets in increasing order of distance from the Image Net pretraining task based on Li et al. (2020) Caltech-256 (Griffin et al., 2007), MIT-67 (Quattoni & Torralba, 2009), Oxford Pets (Parkhi et al., 2012), Stanford Dogs (Khosla et al., 2011), CUB-200 (Wah et al., 2011), FGVC-Aircrafts (Maji et al., 2013), and Stanford Cars (Krause et al., 2013).
Dataset Splits No For experiments on composition and machine unlearning, datasets are split into muliple shards with respect to a fixed random seed by uniform sampling without replacement. The paper does not provide explicit training, validation, or test split percentages or sample counts for any of the datasets used.
Hardware Specification No The paper does not specify the exact hardware used for experiments, such as specific GPU models (e.g., NVIDIA A100), CPU models, or cloud computing instance types. It only generally states 'We run all our experiments with Vi T-L/16...' and 'Timing is computed using the MIT-67 dataset.'
Software Dependencies No The paper mentions using 'Adam optimizer' and conducts experiments with 'Vi T-L/16' models, which implies the use of deep learning frameworks like PyTorch or TensorFlow, but it does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes We run all our experiments on Vision Transformers on image classification tasks. In particular, we use Vi T-L/16 (Dosovitskiy et al., 2020) as the base model in all our experiments, and linearize around its Image Net pre-trained weights, the result of which we call T-Vi T-L/16. For experiments using TAFT in Table 1 and 2, and Figures 1(a)-(c), we train with the RSL loss using κ = 15. We also adopt a 30 epoch learning schedule for each dataset/task, with learning rate decay by a factor of 10 at epochs 15 and 25. We use a batch size of 32 for all our experiments, and train using Adam optimizer. We search over learning rates (LR) of {0.001, 0.0001} for both non-linear fine-tuning and TAFT.