Federated Learning with Only Positive Labels

Authors: Felix Yu, Ankit Singh Rawat, Aditya Menon, Sanjiv Kumar

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate the proposed Fed Aw S method on benchmark image classification and extreme multi-class classification datasets. In all experiments, both the class embedding wc s and instance embedding gθ(x) are ℓ2 normalized, as we found this slightly improves model quality. We compare the following methods in our experiments. Baseline-1: Training with only positive squared hinge loss. As expected, we observe very low precision values because the model quickly collapses to a trivial solution. Baseline-2: Training with only positive squared hinge loss with the class embeddings fixed. This is a simple way of preventing the class embeddings from collapsing into a single point. Fed Aw S: Our method with stochastic negative mining (cf. Section 4.2). Softmax: An oracle method of regular training with the softmax cross-entropy loss function that has access to both positive and negative labels. 6.1. Experiments on CIFAR We first present results on the CIFAR-10 and CIFAR100 datasets. We trained Res Nets (RESNETS) (He et al., 2016a;b) with different number of layers as the underlying model. Specifically, we train RESNET-8 and RESNET-32 for CIFAR-10; and train RESNET-32 and RESNET-56 for CIFAR-100 with the larger number of classes. From Table 1, we see that on both CIFAR-10 and CIFAR100, Fed Aw S almost matches or comes very close to the performance of the oracle method which has access to all labels.
Researcher Affiliation Industry Felix X. Yu 1 Ankit Singh Rawat 1 Aditya Krishna Menon 1 Sanjiv Kumar 1 1Google Research, New York. Correspondence to: Felix X. Yu <felixyu@google.com>, Ankit Singh Rawat <ankitsrawat@google.com>.
Pseudocode Yes Algorithm 1 Federated averaging with spreadout (Fed Aw S)
Open Source Code No The paper does not contain any explicit statements about the release of source code or links to a code repository.
Open Datasets Yes We empirically evaluate the proposed Fed Aw S method on benchmark image classification and extreme multi-class classification datasets. We first present results on the CIFAR-10 and CIFAR100 datasets. ... We test the proposed approach on standard extreme multilabel classification datasets (Varma, 2018). These datasets have a large number of classes, and therefore are a good representatives of the applications of federated learning with only positive labels. Similar to Reddi et al. (2019), because these datasets are multi-label, we uniformly sample positive labels to obtain datasets corresponding to multi-class classification problems. The datasets and their statistics are summarized in Table 2.
Dataset Splits No The paper mentions 'Train Points' and 'Test Points' in Table 2 for the datasets used, but it does not explicitly specify separate validation dataset splits with percentages or sample counts for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions optimizers like 'SGD' and 'Adagrad' and model architectures like 'Res Nets', but it does not specify any software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes For Fed Aw S, we use the squared hinge loss with cosine distance to define ˆRpos(Si) at the clients (cf. Algorithm 1): ℓpos(f(x), y) = max 0, 0.9 gθ(x) wy 2. ... We use a simple embedding-based classification model wherein an instance x Rd , a high-dimensional sparse vector, is first embedded into R512 using a linear embedding lookup followed by averaging. The vector is then passed through a three-layer neural network with layer sizes 1024, 1024 and 512, respectively. The first two layers in the network apply a Re LU activation function. The output of the network is then normalized to obtain instance embeddings with unit ℓ2-norm. Each class is represented as a 512-dimensional normalized vector. ... SGD with a large learning rate is used to optimize the embedding layers, and Adagrad is used to update other model parameters. In each round, we randomly select 4K clients associated with 4K labels. ... There are two meta parameters in the proposed method: the learning rate multiplier of the spreadout loss λ (cf. Algorithm 1), and the number top confusing labels considered in each round k (cf. (8)). To make a fair comparison with other methods which do not have these meta parameters, in all of our other experiments in Table 3, we simply use k = 10 and λ = 10.