Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Linear Connectivity Reveals Generalization Strategies
Authors: Jeevesh Juneja, Rachit Bansal, Kyunghyun Cho, João Sedoc, Naomi Saphra
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finetuning 100 QQP models cost around 500 GPU-hours and 48 Co LA models cost around 10 GPUhours. Each 100 100 interpolation and evaluation consumed 114 GPU hours; these were performed for both QQP (at three stages during finetuning, for Fig. 7) and MNLI. The 48 48 interpolation for Co LA cost about 28 GPU-hours. The experiments add up to a total cost of approximately 994 GPU-hours on a mix of NVIDIA RTX8000 and V100 nodes. |
| Researcher Affiliation | Academia | Jeevesh Juneja1, Rachit Bansal1, Kyunghyun Cho2, Jo ao Sedoc2, Naomi Saphra2 1Delhi Technological University, 2New York University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and models are public.1 Code: https://github.com/aNOnWhyMooS/connectivity; Models: https://huggingface.co/connectivity |
| Open Datasets | Yes | We use the MNLI (Williams et al., 2018) corpus, and inspect losses on the ID matched validation set. ... Quora Question Pairs (QQP; Wang et al., 2017) is a common paraphrase corpus... We use the PAWS-QQP (Zhang et al., 2019) diagnostic set... The Corpus of Linguistic Acceptability (Co LA; Warstadt et al., 2018) is a set of acceptable and unacceptable English sentences collected from the linguistics literature. |
| Dataset Splits | Yes | We use the MNLI (Williams et al., 2018) corpus, and inspect losses on the ID matched validation set. ... The dev and test split of PAWS-QQP. The entire QQP and MNLI validation sets. |
| Hardware Specification | Yes | The experiments add up to a total cost of approximately 994 GPU-hours on a mix of NVIDIA RTX8000 and V100 nodes. |
| Software Dependencies | No | The paper mentions using specific BERT models and training scripts from Google and Hugging Face, along with the Adam W optimizer, but does not provide explicit version numbers for software dependencies such as deep learning frameworks or libraries. |
| Experiment Setup | Yes | The QQP models were trained for 3 epochs, with a learning rate of 2 10−5, a batch size of 32 samples and a weight decay of 0.01 from the bert-base-uncased pre-trained checkpoint using the google script. ... The Co LA models were trained for 6 epochs with a learning rate of 2 10−5, a batch size of 32 samples, and no weight decay. This script uses the Adam W(Loshchilov and Hutter, 2017) optimizer too, with a linear learning rate decay schedule but no warm-up. |