reproducibilityindex.ai

Abrupt Learning in Transformers: A Case Study on Matrix Completion

Authors: Pulkit Gopalani, Ekdeep S Lubana, Wei Hu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train a BERT model TFθ to predict missing entries in a low rank masked matrix X. In our training setup, the model converges to a final MSE of approximately 4e 3 that is, it can solve matrix completion well (as in Fig. 3, this MSE is lower than nuclear norm minimization).
Researcher Affiliation	Academia	Pulkit Gopalani University of Michigan gopalani@umich.edu Ekdeep Singh Lubana Harvard University ekdeeplubana@fas.harvard.edu Wei Hu University of Michigan vvh@umich.edu
Pseudocode	No	The paper describes methods in prose but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at this https://github.com/pulkitgopalani/tf-matcomp.
Open Datasets	No	data for matrix completion is generated as X = UV ; U, V Rn r, Uij, Vij iid Unif[ 1, 1] i, j [n] [r] so that X has rank at most r. To mask entries at random, we sample binary matrices M {0, 1}n n such that Mij = 0 with probability pmask, and 1 otherwise; that is, Ω= {(i, j) \| Mij = 1}.
Dataset Splits	No	Since data is not partitioned into fixed train and test sets, we only analyze the training loss in all cases.
Hardware Specification	Yes	For 7 7 matrices (training and testing), we used a single {V100 / A100 / L40S} GPU. A single {A40 / A100 / L40S} GPU was used for matrices of order 10, 12, 15.
Software Dependencies	No	The paper mentions software like BERT and CVXPY, and the Huggingface's transformers library, but does not provide specific version numbers for these software dependencies (e.g., 'CVXPY 1.x' or 'transformers 4.x').
Experiment Setup	Yes	We use a 4 layer, 8 head BERT model [40] for 7 7 (rank 2) matrices, with absolute positional embeddings, no token type embeddings, and no dropout. We fix pmask = 0.3 for training, and 256 matrices are sampled as training data at each step (in an online training setup). We use Adam optimizer with constant step size 1e 4 for 50000 steps, without weight decay or warmup.