On Affine Homotopy between Language Encoders
Authors: Robin Chan, Reda Boumasmoud, Anej Svete, Yuxin Ren, Qipeng Guo, Zhijing Jin, Shauli Ravfogel, Mrinmaya Sachan, Bernhard Schölkopf, Mennatallah El-Assady, Ryan Cotterell
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now explore the practical implications of our theoretical results. We conduct experiments on ELECTRA [6], ROBERTA [28], and the 25 MULTIBERT [35] encoders, which are architecturally identical to BERT-BASE [11] models pre-trained with different seeds. |
| Researcher Affiliation | Academia | 1ETH Zürich 2Tsinghua University 3Fudan University 4Max Plank Institute for Intelligent Systems |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/chanr0/affine-homotopy |
| Open Datasets | Yes | We report results on the training sets of two GLUE benchmark classification tasks: SST-2 [38] and MRPC [14]. |
| Dataset Splits | Yes | We report results on the training sets of two GLUE benchmark classification tasks: SST-2 [38] and MRPC [14]. |
| Hardware Specification | Yes | We compute the embeddings on a single A100-40GB GPU, which took around two hours. All other experiments were run on 8-32 CPU cores, each with 8 GB of memory. |
| Software Dependencies | No | The paper mentions software like "Riemann SGD" and refers to open-source implementations by other authors [12, 33], but it does not specify explicit version numbers for these or other software components (e.g., Python, PyTorch). |
| Experiment Setup | Yes | Each experiment was run using Riemann SGD37 as an optimizer as it initially produced the best convergence when computing our affine similarity measures. Further, to account for convergence artifacts, we ran the intrinsic similarity computation optimizations in each experiment for learning rates r1E-4, 1E-3, 1E-2, 1E-1s and extrinsic computations for r1E-3, 1E-2, 2E-2s and report the best result. When training the task-specific linear probing classifier ψ1 for ˆdψ1, we use the crossentropy loss, Riemann SGD and optimize over the learning rates r1E-2, 1E-1, 2E-1, 4E-1s. For the computation of Hausdorff Hoare map d H, we fixed a lr of 1E-3 to save compute resources, as this lr generally leads to the best convergence in previous experiments. We used a batch size 64 and let optimization run for 20 epochs, keeping other parameters at default. For reproducibility, we set the initial seed to 42 during training. |