Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers
Authors: Liwei Wu, Shuqing Li, Cho-Jui Hsieh, James L. Sharpnack
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical guarantees for our method and show its empirical effectiveness on 6 distinct tasks, from simple neural networks with one hidden layer in recommender systems, to the transformer and BERT in natural languages. We then conducted experiments for a total of 6 tasks from simple neural networks with one hidden layer in recommender systems, to the transformer and BERT in natural languages and find that when used along with widely-used regularization methods such as weight decay and dropout, our proposed methods can further reduce over-fitting, which often leads to more favorable generalization results. |
| Researcher Affiliation | Academia | Liwei Wu Department of Statistics University of California, Davis Davis, CA 95616 liwu@ucdavis.edu Shuqing Li Department of Computer Science University of California, Davis Davis, CA 95616 qshli@ucdavis.edu Cho-Jui Hsieh Department of Computer Science University of California, Los Angles Los Angles, CA 90095 chohsieh@cs.ucla.edu James Sharpnack Department of Statistics University of California, Davis Davis, CA 95616 jsharpna@ucdavis.edu |
| Pseudocode | Yes | Algorithm 1 SSE-Graph for Neural Networks with Embeddings |
| Open Source Code | No | The paper mentions 'We use the Open NMT implementation in our experiments.' but does not provide any statement or link for the source code of their own proposed method. |
| Open Datasets | Yes | The paper uses well-known public datasets such as 'Movielens1m', 'Movielens10m', 'Netflix', 'WMT 2014 English to German dataset', 'IMDB movie reviews', and 'SST-2 sentiment classification task'. These are standard academic datasets commonly used in the research community. |
| Dataset Splits | Yes | The paper refers to 'test sets' and 'dev set' (e.g., 'SST-2 Dev Set SST-2 Test Set'), implying the use of standard, predefined splits for the benchmark datasets. It also states 'Note that the details about datasets and parameter settings can be found in the appendix.' |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory amounts) used for conducting the experiments. |
| Software Dependencies | No | The paper mentions 'We use the Open NMT implementation in our experiments.' but does not provide a specific version number for Open NMT or any other software dependencies. |
| Experiment Setup | Yes | The paper provides specific hyperparameters: 'We use the dropout probability of 0.1, weight decay of 1e-5, and learning rate of 1e-3 for all experiments.' (Section 4.2), 'We use the same dropout rate of 0.1 and label smoothing value of 0.1 for the baseline model and our SSE-enhanced model. The only difference between the two models is whether or not we use our proposed SSE-SE with p0 = 0.01 in (5) for both encoder and decoder embedding layers.' (Section 4.3), and 'We use SSE probability of 0.015 for embeddings (onehot encodings) associated with labels and SSE probability of 0.015 for embeddings (word-piece embeddings) associated with inputs.' (Section 4.4). |