Abrupt Learning in Transformers: A Case Study on Matrix Completion

Authors: Pulkit Gopalani, Ekdeep S Lubana, Wei Hu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train a BERT model TFθ to predict missing entries in a low rank masked matrix X. In our training setup, the model converges to a final MSE of approximately 4e 3 that is, it can solve matrix completion well (as in Fig. 3, this MSE is lower than nuclear norm minimization).
Researcher Affiliation Academia Pulkit Gopalani University of Michigan gopalani@umich.edu Ekdeep Singh Lubana Harvard University ekdeeplubana@fas.harvard.edu Wei Hu University of Michigan vvh@umich.edu
Pseudocode No The paper describes methods in prose but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at this https://github.com/pulkitgopalani/tf-matcomp.
Open Datasets No data for matrix completion is generated as X = UV ; U, V Rn r, Uij, Vij iid Unif[ 1, 1] i, j [n] [r] so that X has rank at most r. To mask entries at random, we sample binary matrices M {0, 1}n n such that Mij = 0 with probability pmask, and 1 otherwise; that is, Ω= {(i, j) | Mij = 1}.
Dataset Splits No Since data is not partitioned into fixed train and test sets, we only analyze the training loss in all cases.
Hardware Specification Yes For 7 7 matrices (training and testing), we used a single {V100 / A100 / L40S} GPU. A single {A40 / A100 / L40S} GPU was used for matrices of order 10, 12, 15.
Software Dependencies No The paper mentions software like BERT and CVXPY, and the Huggingface's transformers library, but does not provide specific version numbers for these software dependencies (e.g., 'CVXPY 1.x' or 'transformers 4.x').
Experiment Setup Yes We use a 4 layer, 8 head BERT model [40] for 7 7 (rank 2) matrices, with absolute positional embeddings, no token type embeddings, and no dropout. We fix pmask = 0.3 for training, and 256 matrices are sampled as training data at each step (in an online training setup). We use Adam optimizer with constant step size 1e 4 for 50000 steps, without weight decay or warmup.