Linear Connectivity Reveals Generalization Strategies
Authors: Jeevesh Juneja, Rachit Bansal, Kyunghyun Cho, João Sedoc, Naomi Saphra
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finetuning 100 QQP models cost around 500 GPU-hours and 48 Co LA models cost around 10 GPUhours. Each 100 100 interpolation and evaluation consumed 114 GPU hours; these were performed for both QQP (at three stages during finetuning, for Fig. 7) and MNLI. The 48 48 interpolation for Co LA cost about 28 GPU-hours. The experiments add up to a total cost of approximately 994 GPU-hours on a mix of NVIDIA RTX8000 and V100 nodes. |
| Researcher Affiliation | Academia | Jeevesh Juneja1, Rachit Bansal1, Kyunghyun Cho2, Jo ao Sedoc2, Naomi Saphra2 1Delhi Technological University, 2New York University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and models are public.1 Code: https://github.com/aNOnWhyMooS/connectivity; Models: https://huggingface.co/connectivity |
| Open Datasets | Yes | We use the MNLI (Williams et al., 2018) corpus, and inspect losses on the ID matched validation set. ... Quora Question Pairs (QQP; Wang et al., 2017) is a common paraphrase corpus... We use the PAWS-QQP (Zhang et al., 2019) diagnostic set... The Corpus of Linguistic Acceptability (Co LA; Warstadt et al., 2018) is a set of acceptable and unacceptable English sentences collected from the linguistics literature. |
| Dataset Splits | Yes | We use the MNLI (Williams et al., 2018) corpus, and inspect losses on the ID matched validation set. ... The dev and test split of PAWS-QQP. The entire QQP and MNLI validation sets. |
| Hardware Specification | Yes | The experiments add up to a total cost of approximately 994 GPU-hours on a mix of NVIDIA RTX8000 and V100 nodes. |
| Software Dependencies | No | The paper mentions using specific BERT models and training scripts from Google and Hugging Face, along with the Adam W optimizer, but does not provide explicit version numbers for software dependencies such as deep learning frameworks or libraries. |
| Experiment Setup | Yes | The QQP models were trained for 3 epochs, with a learning rate of 2 10−5, a batch size of 32 samples and a weight decay of 0.01 from the bert-base-uncased pre-trained checkpoint using the google script. ... The Co LA models were trained for 6 epochs with a learning rate of 2 10−5, a batch size of 32 samples, and no weight decay. This script uses the Adam W(Loshchilov and Hutter, 2017) optimizer too, with a linear learning rate decay schedule but no warm-up. |