Linear Connectivity Reveals Generalization Strategies

Authors: Jeevesh Juneja, Rachit Bansal, Kyunghyun Cho, João Sedoc, Naomi Saphra

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finetuning 100 QQP models cost around 500 GPU-hours and 48 Co LA models cost around 10 GPUhours. Each 100 100 interpolation and evaluation consumed 114 GPU hours; these were performed for both QQP (at three stages during finetuning, for Fig. 7) and MNLI. The 48 48 interpolation for Co LA cost about 28 GPU-hours. The experiments add up to a total cost of approximately 994 GPU-hours on a mix of NVIDIA RTX8000 and V100 nodes.
Researcher Affiliation Academia Jeevesh Juneja1, Rachit Bansal1, Kyunghyun Cho2, Jo ao Sedoc2, Naomi Saphra2 1Delhi Technological University, 2New York University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and models are public.1 Code: https://github.com/aNOnWhyMooS/connectivity; Models: https://huggingface.co/connectivity
Open Datasets Yes We use the MNLI (Williams et al., 2018) corpus, and inspect losses on the ID matched validation set. ... Quora Question Pairs (QQP; Wang et al., 2017) is a common paraphrase corpus... We use the PAWS-QQP (Zhang et al., 2019) diagnostic set... The Corpus of Linguistic Acceptability (Co LA; Warstadt et al., 2018) is a set of acceptable and unacceptable English sentences collected from the linguistics literature.
Dataset Splits Yes We use the MNLI (Williams et al., 2018) corpus, and inspect losses on the ID matched validation set. ... The dev and test split of PAWS-QQP. The entire QQP and MNLI validation sets.
Hardware Specification Yes The experiments add up to a total cost of approximately 994 GPU-hours on a mix of NVIDIA RTX8000 and V100 nodes.
Software Dependencies No The paper mentions using specific BERT models and training scripts from Google and Hugging Face, along with the Adam W optimizer, but does not provide explicit version numbers for software dependencies such as deep learning frameworks or libraries.
Experiment Setup Yes The QQP models were trained for 3 epochs, with a learning rate of 2 10−5, a batch size of 32 samples and a weight decay of 0.01 from the bert-base-uncased pre-trained checkpoint using the google script. ... The Co LA models were trained for 6 epochs with a learning rate of 2 10−5, a batch size of 32 samples, and no weight decay. This script uses the Adam W(Loshchilov and Hutter, 2017) optimizer too, with a linear learning rate decay schedule but no warm-up.