Universal Sample Coding

Authors: Szymon Kobus, Tze-Yang Tung, Deniz Gunduz

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To corroborate the claims about universal sample coding made in Theorem 5.1, we compare the KL-divergence and total communication cost between the theory and numerical evaluations for different number of samples and dimensions. Each experiment is repeated 1000 times, where a probability distribution P is sampled from a k-dimensional Dirichlet distribution with concentration parameters α Rk, αi = 1. Then, n samples are communicated using universal sample coding (Algorithm 1). We consider ordered random coding (Theis and Yosri, 2022), shown in Appendix C, as the underlying sample communication method (line 7, Algorithm 1). In Figure 3, we observe that the measured KL-divergence DKL(P ˆQ) between the true probability P and the estimate ˆQ from the estimator in Lemma 5.2 follows the predicted values across a range of communicated samples. In Figure 4, we fix the number of samples to n = 128, but vary the dimension k, and show that the KL-divergence lies below the bound from Lemma 5.2. In Figure 5, we plot the total communication cost for various number of samples, which is contained between the asymptotic upper and lower bounds, Vk(c) log(n) (Theorem 5.1) and k 1 2 log(n) (Theorem 5.3), respectively. To show the efficacy of the universal sample coding in practice, we first apply it to the Federated Probabilistic Mask Training (Fed PM) algorithm proposed in Isik et al. (2024).
Researcher Affiliation Collaboration Szymon Kobus Dep. of Electrical and Electronic Engineering Imperial College London szymon.kobus17@imperial.ac.uk Tze-Yang Tung Not Diamond tze-yang@notdiamond.ai Deniz Gündüz Department of Electrical and Electronic Engineering Imperial College London d.gunduz@imperial.ac.uk
Pseudocode Yes Algorithm 1 Universal Sample Coding Input: P, n 1 Parameter: c > 0 1: Send sample X1 P. 2: Let i = 2, G(1) = 1. 3: while i log1+c(n) + 1 do 4: Let G(i) = (1 + c)i 1 5: Let Qi = ˆQi(XG(i 1)) be probability estimated with all previous samples (estimator Lemma 5.2). 6: Let g(i) = G(i) G(i 1). 7: Jointly send g(i) samples XG(i) G(i 1)+1 P g(i) using Q g(i) i (Eqn. (10)). 8: i = i + 1 9: end while
Open Source Code No Code implementing Universal Sample Coding, as well as the large language model experiments, is released. Code for federated learning is based on a yet-unreleased codebase shared by the authors of Isik et al. (2024). We are in ongoing discussions with the authors on the best way to release this code, but it will not be publicly available at the time of publication.
Open Datasets Yes For the experiments, we follow the same setup as Isik et al. (2024) using their CONV-6 architecture on classification of CIFAR-10 images, where the data is distributed among 10 clients.
Dataset Splits No The paper discusses training and testing, but it does not explicitly specify the use of a validation set or provide details on how data was split for validation purposes (e.g., percentages, sample counts).
Hardware Specification Yes The FL training was conducted on two Nvidia RTX 3090 GPUs, each with 25 GB of memory, with each experiment taking approximately 12 hours. The experiments were performed on 4 Nvidia RTX A6000 GPUs with 48GB of memory and Py Torch version 2.1, totalling 10 hours of wall-clock-time.
Software Dependencies No The paper mentions "Py Torch version 2.1" but does not provide specific version numbers for other software dependencies or libraries used in the experiments to ensure full reproducibility.
Experiment Setup Yes For the experiments, we follow the same setup as Isik et al. (2024) using their CONV-6 architecture on classification of CIFAR-10 images, where the data is distributed among 10 clients. A challenging scenario considered in that work is of congested clients, where only a fraction of clients contribute in each learning round. The results of our experiments are summarized in Table 2. The FL training was conducted on two Nvidia RTX 3090 GPUs, each with 25 GB of memory, with each experiment taking approximately 12 hours. The training, including preliminary experiments, took 30 days of gpu-time. ... To incorporate the prior θ into the proposed coding scheme, we use a Bayesian estimator. Although the mask is binary, we describe a general case with k possible outcomes. Let ω Pk be the prior probability, and ω be a random variable distributed according to the Dirichlet distribution with parameters α Rk, for j {0, . . . , k 1}: αj = µωj + PG l=1 1 {l-th sample = j} , where 1{ } is the indicator function, G is the number of samples communicated so far, and µ R+ is a hyperparameter controlling the reliance on the prior.