reproducibilityindex.ai

TRAM: Bridging Trust Regions and Sharpness Aware Minimization

Authors: Tom Sherborne, Naomi Saphra, Pradeep Dasigi, Hao Peng

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate TRAM in vision (cross-dataset adaptation) and text (OOD language modeling, zero-shot cross-lingual transfer) tasks where robust domain transfer and representation generality are critical. TRAM outperforms SAMand TR-based optimization across all tasks, notably surpassing competing methods for hard transfer between anticorrelated domains.
Researcher Affiliation	Collaboration	Tom Sherborne1 Naomi Saphra2 Pradeep Dasigi3 Hao Peng4 1University of Edinburgh 2Kempner Institute, Harvard University 3Allen Institute for AI 4University of Illinois Urbana-Champaign
Pseudocode	Yes	Algorithm 1 in Appendix B.6 details the full training algorithm for TRAM based on the SAM-style min-max optimization routine. ... Algorithm 2 details the TRAM-Fisher algorithm.
Open Source Code	Yes	Code at github.com/tomsherborne/tram_optimizer.
Open Datasets	Yes	For vision modality experiments, we evaluate cross-dataset transfer from Image Net (Deng et al., 2009) to CIFAR-100 (Krizhevsky, 2009), Stanford Cars (Krause et al., 2013), and Oxford Flowers (Nilsback & Zisserman, 2008). We source all datasets from Hugging Face2 using the default training/testing partitions. ... We evaluate the M2D2 dataset (Reid et al., 2022) for cross-domain language modeling.
Dataset Splits	Yes	When using validation loss for model selection, we use only the validation partition of the training domain to reflect a stricter evaluation setup without access to additional domains during training. ... Table 7 details the partition sizes (in tokens) for each domain in M2D2. ... Table 8: Data splits for XNLI (Conneau et al., 2018). ... English EN 393K 2.5K 5K
Hardware Specification	Yes	All models are trained 1 A100 80GB GPU for under 72 hours except for GPT2-XL experiments in Appendix C.2. ... but we use 4 A100 GPUs for training each with a batch size per device of 4 blocks 1024 tokens.
Software Dependencies	No	The paper mentions optimizers like Adam and uses specific models like GPT-2 and XLM-Roberta, but it does not list specific software dependencies with version numbers (e.g., 'Python 3.x', 'PyTorch 1.x').
Experiment Setup	Yes	For language tasks, we fine-tune each pre-trained model for 50,000 steps using an initial learning rate of 2 10 5, a polynomial decay schedule, and 10,000 step learning rate warmup. We use Adam (Kingma & Ba, 2017), with a decay factor setting (β1, β2) = (0.9, 0.99), as the base optimizer for each SAM-style and TR method unless mentioned otherwise. ... We match the experimental setting of Kim et al. (2022): fine-tuning Vi T-base-16 for 200 epochs with a base optimizer of SGD, an initial learning rate of 5 10 4, and a cosine learning rate decay with no warmup or restarts.