Abrupt Learning in Transformers: A Case Study on Matrix Completion
Authors: Pulkit Gopalani, Ekdeep S Lubana, Wei Hu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train a BERT model TFθ to predict missing entries in a low rank masked matrix X. In our training setup, the model converges to a final MSE of approximately 4e 3 that is, it can solve matrix completion well (as in Fig. 3, this MSE is lower than nuclear norm minimization). |
| Researcher Affiliation | Academia | Pulkit Gopalani University of Michigan gopalani@umich.edu Ekdeep Singh Lubana Harvard University ekdeeplubana@fas.harvard.edu Wei Hu University of Michigan vvh@umich.edu |
| Pseudocode | No | The paper describes methods in prose but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at this https://github.com/pulkitgopalani/tf-matcomp. |
| Open Datasets | No | data for matrix completion is generated as X = UV ; U, V Rn r, Uij, Vij iid Unif[ 1, 1] i, j [n] [r] so that X has rank at most r. To mask entries at random, we sample binary matrices M {0, 1}n n such that Mij = 0 with probability pmask, and 1 otherwise; that is, Ω= {(i, j) | Mij = 1}. |
| Dataset Splits | No | Since data is not partitioned into fixed train and test sets, we only analyze the training loss in all cases. |
| Hardware Specification | Yes | For 7 7 matrices (training and testing), we used a single {V100 / A100 / L40S} GPU. A single {A40 / A100 / L40S} GPU was used for matrices of order 10, 12, 15. |
| Software Dependencies | No | The paper mentions software like BERT and CVXPY, and the Huggingface's transformers library, but does not provide specific version numbers for these software dependencies (e.g., 'CVXPY 1.x' or 'transformers 4.x'). |
| Experiment Setup | Yes | We use a 4 layer, 8 head BERT model [40] for 7 7 (rank 2) matrices, with absolute positional embeddings, no token type embeddings, and no dropout. We fix pmask = 0.3 for training, and 256 matrices are sampled as training data at each step (in an online training setup). We use Adam optimizer with constant step size 1e 4 for 50000 steps, without weight decay or warmup. |