TRAM: Bridging Trust Regions and Sharpness Aware Minimization
Authors: Tom Sherborne, Naomi Saphra, Pradeep Dasigi, Hao Peng
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate TRAM in vision (cross-dataset adaptation) and text (OOD language modeling, zero-shot cross-lingual transfer) tasks where robust domain transfer and representation generality are critical. TRAM outperforms SAMand TR-based optimization across all tasks, notably surpassing competing methods for hard transfer between anticorrelated domains. |
| Researcher Affiliation | Collaboration | Tom Sherborne1 Naomi Saphra2 Pradeep Dasigi3 Hao Peng4 1University of Edinburgh 2Kempner Institute, Harvard University 3Allen Institute for AI 4University of Illinois Urbana-Champaign |
| Pseudocode | Yes | Algorithm 1 in Appendix B.6 details the full training algorithm for TRAM based on the SAM-style min-max optimization routine. ... Algorithm 2 details the TRAM-Fisher algorithm. |
| Open Source Code | Yes | Code at github.com/tomsherborne/tram_optimizer. |
| Open Datasets | Yes | For vision modality experiments, we evaluate cross-dataset transfer from Image Net (Deng et al., 2009) to CIFAR-100 (Krizhevsky, 2009), Stanford Cars (Krause et al., 2013), and Oxford Flowers (Nilsback & Zisserman, 2008). We source all datasets from Hugging Face2 using the default training/testing partitions. ... We evaluate the M2D2 dataset (Reid et al., 2022) for cross-domain language modeling. |
| Dataset Splits | Yes | When using validation loss for model selection, we use only the validation partition of the training domain to reflect a stricter evaluation setup without access to additional domains during training. ... Table 7 details the partition sizes (in tokens) for each domain in M2D2. ... Table 8: Data splits for XNLI (Conneau et al., 2018). ... English EN 393K 2.5K 5K |
| Hardware Specification | Yes | All models are trained 1 A100 80GB GPU for under 72 hours except for GPT2-XL experiments in Appendix C.2. ... but we use 4 A100 GPUs for training each with a batch size per device of 4 blocks 1024 tokens. |
| Software Dependencies | No | The paper mentions optimizers like Adam and uses specific models like GPT-2 and XLM-Roberta, but it does not list specific software dependencies with version numbers (e.g., 'Python 3.x', 'PyTorch 1.x'). |
| Experiment Setup | Yes | For language tasks, we fine-tune each pre-trained model for 50,000 steps using an initial learning rate of 2 10 5, a polynomial decay schedule, and 10,000 step learning rate warmup. We use Adam (Kingma & Ba, 2017), with a decay factor setting (β1, β2) = (0.9, 0.99), as the base optimizer for each SAM-style and TR method unless mentioned otherwise. ... We match the experimental setting of Kim et al. (2022): fine-tuning Vi T-base-16 for 200 epochs with a base optimizer of SGD, an initial learning rate of 5 10 4, and a cosine learning rate decay with no warmup or restarts. |