TRAM: Bridging Trust Regions and Sharpness Aware Minimization

Authors: Tom Sherborne, Naomi Saphra, Pradeep Dasigi, Hao Peng

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate TRAM in vision (cross-dataset adaptation) and text (OOD language modeling, zero-shot cross-lingual transfer) tasks where robust domain transfer and representation generality are critical. TRAM outperforms SAMand TR-based optimization across all tasks, notably surpassing competing methods for hard transfer between anticorrelated domains.
Researcher Affiliation Collaboration Tom Sherborne1 Naomi Saphra2 Pradeep Dasigi3 Hao Peng4 1University of Edinburgh 2Kempner Institute, Harvard University 3Allen Institute for AI 4University of Illinois Urbana-Champaign
Pseudocode Yes Algorithm 1 in Appendix B.6 details the full training algorithm for TRAM based on the SAM-style min-max optimization routine. ... Algorithm 2 details the TRAM-Fisher algorithm.
Open Source Code Yes Code at github.com/tomsherborne/tram_optimizer.
Open Datasets Yes For vision modality experiments, we evaluate cross-dataset transfer from Image Net (Deng et al., 2009) to CIFAR-100 (Krizhevsky, 2009), Stanford Cars (Krause et al., 2013), and Oxford Flowers (Nilsback & Zisserman, 2008). We source all datasets from Hugging Face2 using the default training/testing partitions. ... We evaluate the M2D2 dataset (Reid et al., 2022) for cross-domain language modeling.
Dataset Splits Yes When using validation loss for model selection, we use only the validation partition of the training domain to reflect a stricter evaluation setup without access to additional domains during training. ... Table 7 details the partition sizes (in tokens) for each domain in M2D2. ... Table 8: Data splits for XNLI (Conneau et al., 2018). ... English EN 393K 2.5K 5K
Hardware Specification Yes All models are trained 1 A100 80GB GPU for under 72 hours except for GPT2-XL experiments in Appendix C.2. ... but we use 4 A100 GPUs for training each with a batch size per device of 4 blocks 1024 tokens.
Software Dependencies No The paper mentions optimizers like Adam and uses specific models like GPT-2 and XLM-Roberta, but it does not list specific software dependencies with version numbers (e.g., 'Python 3.x', 'PyTorch 1.x').
Experiment Setup Yes For language tasks, we fine-tune each pre-trained model for 50,000 steps using an initial learning rate of 2 10 5, a polynomial decay schedule, and 10,000 step learning rate warmup. We use Adam (Kingma & Ba, 2017), with a decay factor setting (β1, β2) = (0.9, 0.99), as the base optimizer for each SAM-style and TR method unless mentioned otherwise. ... We match the experimental setting of Kim et al. (2022): fine-tuning Vi T-base-16 for 200 epochs with a base optimizer of SGD, an initial learning rate of 5 10 4, and a cosine learning rate decay with no warmup or restarts.