Loss Decomposition for Fast Learning in Large Output Spaces
Authors: Ian En-Hsu Yen, Satyen Kale, Felix Yu, Daniel Holtmann-Rice, Sanjiv Kumar, Pradeep Ravikumar
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments on multiclass and multilabel classification with hundreds of thousands of classes, as well as training skip-gram word embeddings with a vocabulary size of half a million, our technique consistently improves the accuracy of search-based gradient approximation methods and outperforms sampling-based gradient approximation methods by a large margin. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University, Pittsburgh, USA 2Google, New York, USA. Correspondence to: Ian E.H. Yen <eyan@cs.cmu.edu>, Satyen Kale <satyenkale@google.com>. |
| Pseudocode | Yes | Algorithm 1 Loss and Gradient Approximation via Search |
| Open Source Code | No | The paper does not provide a specific link or explicit statement about releasing the source code for their proposed methodology. |
| Open Datasets | Yes | multiclass classification we conduct experiments on the largest publicly available facial recognition dataset Mega Face (Challenge 2)2, where each identity is considered a class, and each sample is an image cropped by a face detector. The data set statistics are shown in Table 1. |
| Dataset Splits | No | The paper uses "Test Accuracy" and "Train Accuracy" in its figures, implying the use of splits, but it does not explicitly provide the specific percentages, sample counts, or methodology for splitting the datasets into training, validation, or test sets. |
| Hardware Specification | No | The paper states that methods are 'parallelized with 10 CPU cores in a shared-memory architecture, running on a dedicated machine' and 'parallelized with 24 CPU cores', but it does not provide specific details such as CPU model, GPU model, or memory specifications. |
| Software Dependencies | No | The paper states 'All the implementation are in C++' and refers to several external methods and packages like 'Spherical Clustering' and 'word2vec package', but it does not provide specific version numbers for any software dependencies required to replicate the experiments. |
| Experiment Setup | Yes | For multiclass and multilabel classification, we employ a Stochastic Gradient Descent (SGD) optimization algorithm, with an initial step size chosen from {1, 0.1, 0.01} for the best performance of each method, with a 1/(1 + t) cooling scheme where t is the iteration counter. The minibatch size is 10 |