Loss Decomposition for Fast Learning in Large Output Spaces

Authors: Ian En-Hsu Yen, Satyen Kale, Felix Yu, Daniel Holtmann-Rice, Sanjiv Kumar, Pradeep Ravikumar

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments on multiclass and multilabel classification with hundreds of thousands of classes, as well as training skip-gram word embeddings with a vocabulary size of half a million, our technique consistently improves the accuracy of search-based gradient approximation methods and outperforms sampling-based gradient approximation methods by a large margin.
Researcher Affiliation Collaboration 1Carnegie Mellon University, Pittsburgh, USA 2Google, New York, USA. Correspondence to: Ian E.H. Yen <eyan@cs.cmu.edu>, Satyen Kale <satyenkale@google.com>.
Pseudocode Yes Algorithm 1 Loss and Gradient Approximation via Search
Open Source Code No The paper does not provide a specific link or explicit statement about releasing the source code for their proposed methodology.
Open Datasets Yes multiclass classification we conduct experiments on the largest publicly available facial recognition dataset Mega Face (Challenge 2)2, where each identity is considered a class, and each sample is an image cropped by a face detector. The data set statistics are shown in Table 1.
Dataset Splits No The paper uses "Test Accuracy" and "Train Accuracy" in its figures, implying the use of splits, but it does not explicitly provide the specific percentages, sample counts, or methodology for splitting the datasets into training, validation, or test sets.
Hardware Specification No The paper states that methods are 'parallelized with 10 CPU cores in a shared-memory architecture, running on a dedicated machine' and 'parallelized with 24 CPU cores', but it does not provide specific details such as CPU model, GPU model, or memory specifications.
Software Dependencies No The paper states 'All the implementation are in C++' and refers to several external methods and packages like 'Spherical Clustering' and 'word2vec package', but it does not provide specific version numbers for any software dependencies required to replicate the experiments.
Experiment Setup Yes For multiclass and multilabel classification, we employ a Stochastic Gradient Descent (SGD) optimization algorithm, with an initial step size chosen from {1, 0.1, 0.01} for the best performance of each method, with a 1/(1 + t) cooling scheme where t is the iteration counter. The minibatch size is 10