On Affine Homotopy between Language Encoders

Authors: Robin Chan, Reda Boumasmoud, Anej Svete, Yuxin Ren, Qipeng Guo, Zhijing Jin, Shauli Ravfogel, Mrinmaya Sachan, Bernhard Schölkopf, Mennatallah El-Assady, Ryan Cotterell

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now explore the practical implications of our theoretical results. We conduct experiments on ELECTRA [6], ROBERTA [28], and the 25 MULTIBERT [35] encoders, which are architecturally identical to BERT-BASE [11] models pre-trained with different seeds.
Researcher Affiliation Academia 1ETH Zürich 2Tsinghua University 3Fudan University 4Max Plank Institute for Intelligent Systems
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/chanr0/affine-homotopy
Open Datasets Yes We report results on the training sets of two GLUE benchmark classification tasks: SST-2 [38] and MRPC [14].
Dataset Splits Yes We report results on the training sets of two GLUE benchmark classification tasks: SST-2 [38] and MRPC [14].
Hardware Specification Yes We compute the embeddings on a single A100-40GB GPU, which took around two hours. All other experiments were run on 8-32 CPU cores, each with 8 GB of memory.
Software Dependencies No The paper mentions software like "Riemann SGD" and refers to open-source implementations by other authors [12, 33], but it does not specify explicit version numbers for these or other software components (e.g., Python, PyTorch).
Experiment Setup Yes Each experiment was run using Riemann SGD37 as an optimizer as it initially produced the best convergence when computing our affine similarity measures. Further, to account for convergence artifacts, we ran the intrinsic similarity computation optimizations in each experiment for learning rates r1E-4, 1E-3, 1E-2, 1E-1s and extrinsic computations for r1E-3, 1E-2, 2E-2s and report the best result. When training the task-specific linear probing classifier ψ1 for ˆdψ1, we use the crossentropy loss, Riemann SGD and optimize over the learning rates r1E-2, 1E-1, 2E-1, 4E-1s. For the computation of Hausdorff Hoare map d H, we fixed a lr of 1E-3 to save compute resources, as this lr generally leads to the best convergence in previous experiments. We used a batch size 64 and let optimization run for 20 epochs, keeping other parameters at default. For reproducibility, we set the initial seed to 42 during training.