Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

Authors: Liwei Wu, Shuqing Li, Cho-Jui Hsieh, James L. Sharpnack

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical guarantees for our method and show its empirical effectiveness on 6 distinct tasks, from simple neural networks with one hidden layer in recommender systems, to the transformer and BERT in natural languages. We then conducted experiments for a total of 6 tasks from simple neural networks with one hidden layer in recommender systems, to the transformer and BERT in natural languages and find that when used along with widely-used regularization methods such as weight decay and dropout, our proposed methods can further reduce over-fitting, which often leads to more favorable generalization results.
Researcher Affiliation Academia Liwei Wu Department of Statistics University of California, Davis Davis, CA 95616 liwu@ucdavis.edu Shuqing Li Department of Computer Science University of California, Davis Davis, CA 95616 qshli@ucdavis.edu Cho-Jui Hsieh Department of Computer Science University of California, Los Angles Los Angles, CA 90095 chohsieh@cs.ucla.edu James Sharpnack Department of Statistics University of California, Davis Davis, CA 95616 jsharpna@ucdavis.edu
Pseudocode Yes Algorithm 1 SSE-Graph for Neural Networks with Embeddings
Open Source Code No The paper mentions 'We use the Open NMT implementation in our experiments.' but does not provide any statement or link for the source code of their own proposed method.
Open Datasets Yes The paper uses well-known public datasets such as 'Movielens1m', 'Movielens10m', 'Netflix', 'WMT 2014 English to German dataset', 'IMDB movie reviews', and 'SST-2 sentiment classification task'. These are standard academic datasets commonly used in the research community.
Dataset Splits Yes The paper refers to 'test sets' and 'dev set' (e.g., 'SST-2 Dev Set SST-2 Test Set'), implying the use of standard, predefined splits for the benchmark datasets. It also states 'Note that the details about datasets and parameter settings can be found in the appendix.'
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory amounts) used for conducting the experiments.
Software Dependencies No The paper mentions 'We use the Open NMT implementation in our experiments.' but does not provide a specific version number for Open NMT or any other software dependencies.
Experiment Setup Yes The paper provides specific hyperparameters: 'We use the dropout probability of 0.1, weight decay of 1e-5, and learning rate of 1e-3 for all experiments.' (Section 4.2), 'We use the same dropout rate of 0.1 and label smoothing value of 0.1 for the baseline model and our SSE-enhanced model. The only difference between the two models is whether or not we use our proposed SSE-SE with p0 = 0.01 in (5) for both encoder and decoder embedding layers.' (Section 4.3), and 'We use SSE probability of 0.015 for embeddings (onehot encodings) associated with labels and SSE probability of 0.015 for embeddings (word-piece embeddings) associated with inputs.' (Section 4.4).