reproducibilityindex.ai

On Affine Homotopy between Language Encoders

Authors: Robin Chan, Reda Boumasmoud, Anej Svete, Yuxin Ren, Qipeng Guo, Zhijing Jin, Shauli Ravfogel, Mrinmaya Sachan, Bernhard Schölkopf, Mennatallah El-Assady, Ryan Cotterell

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now explore the practical implications of our theoretical results. We conduct experiments on ELECTRA [6], ROBERTA [28], and the 25 MULTIBERT [35] encoders, which are architecturally identical to BERT-BASE [11] models pre-trained with different seeds.
Researcher Affiliation	Academia	1ETH Zürich 2Tsinghua University 3Fudan University 4Max Plank Institute for Intelligent Systems
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/chanr0/affine-homotopy
Open Datasets	Yes	We report results on the training sets of two GLUE benchmark classification tasks: SST-2 [38] and MRPC [14].
Dataset Splits	Yes	We report results on the training sets of two GLUE benchmark classification tasks: SST-2 [38] and MRPC [14].
Hardware Specification	Yes	We compute the embeddings on a single A100-40GB GPU, which took around two hours. All other experiments were run on 8-32 CPU cores, each with 8 GB of memory.
Software Dependencies	No	The paper mentions software like "Riemann SGD" and refers to open-source implementations by other authors [12, 33], but it does not specify explicit version numbers for these or other software components (e.g., Python, PyTorch).
Experiment Setup	Yes	Each experiment was run using Riemann SGD37 as an optimizer as it initially produced the best convergence when computing our affine similarity measures. Further, to account for convergence artifacts, we ran the intrinsic similarity computation optimizations in each experiment for learning rates r1E-4, 1E-3, 1E-2, 1E-1s and extrinsic computations for r1E-3, 1E-2, 2E-2s and report the best result. When training the task-specific linear probing classifier ψ1 for ˆdψ1, we use the crossentropy loss, Riemann SGD and optimize over the learning rates r1E-2, 1E-1, 2E-1, 4E-1s. For the computation of Hausdorff Hoare map d H, we fixed a lr of 1E-3 to save compute resources, as this lr generally leads to the best convergence in previous experiments. We used a batch size 64 and let optimization run for 20 epochs, keeping other parameters at default. For reproducibility, we set the initial seed to 42 during training.