Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Gaussian-Smoothed Sliced Probability Divergences

Authors: Mokhtar Z. Alaya, Alain Rakotomamonjy, Maxime Berar, Gilles Gasso

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We support our theoretical findings with empirical studies in the context of privacy-preserving domain adaptation. 4 Numerical Experiments In this section, we report on a series of experiments that support the established theoretical results. We also highlight the usefulness of the findings in a context of privacy-preserving domain adaptation problem. 4.1 Supporting the theoretical results Sample complexity. The first experiment (see Figure 1) analyzes the sample complexity of different base divergences. It shows that the sample complexity stays similar to the one of their original and sliced counterparts up to a constant (see Proposition 3.8). For this purpose, we have considered samples in Rd randomly drawn from a Normal distribution N(0, I). For the Sinkhorn divergence, the entropy regularization has been set to 0.1 and for MMD, we used a Gaussian kernel for which the bandwidth has been set to the mean of all pairwise distances between samples. The number of projections has been fixed to L = 50 and we perform 20 runs per experiment. For the first study, the convergence rate has been evaluated by increasing the samples number up to 25,000 with fixed dimension d = 50. For the second one, we vary both the dimension and the number of samples. Figure 1 shows the sample complexity of some sliced divergences, respectively noted as SWD, SKD and MMD for Sliced Wasserstein distance, Sinkhorn divergence and Maximum Mean discrepancy and their Gaussian-smoothed sliced versions, named as GS SWD, GS SKD and GS MMD. On the top plot, we can see that all Gaussian-smoothed sliced divergences preserve the complexity rate with just a slight to moderate overhead. The worst difference is for Sinkhorn divergence, while MMD almost comes for free in term of complexity. From the bottom plot where sample complexities for different dimensions d are given, we confirm the finding that Gaussian smoothing keeps the independence of the convergence rate to the dimension of sliced divergences. Two other experiments on the sample complexity and identity of indiscernibles are also reported in the supplementary material. Projection complexity. We have also investigated the impact of the number of projections when estimating the distance between two sets of 500 samples drawn from the same distribution, N(0, I). Figure 2 plots the approximation error between the true expectation of the sliced divergences (computed for a number of L = 10, 000 projections) and its approximated versions. We remark that, for all methods, the error ranges within 10-fold when approximating with 50 projections and decreases with the number of projections. Performance path on the impact of the noise parameter. Since the Gaussian smoothing parameter σ is key in a privacy preserving context, as it impacts on the level of privacy of the Gaussian mechanism, we have analyzed its impact on the smoothed sliced divergence. We have reproduced the experiment for the sample complexity but with different values of σ. The number of projections has been set to 50. Figure 3 shows these sample complexities. The first very interesting point to note is that the smoothing parameter has almost no effect on the GS MMD sample complexity. For the GS SWD and GS SKD divergences, instead, the smoothing tends to increase the divergence at fixed number of samples. Another interpretation is that to achieve a given value of divergence, one needs more far samples when the smoothing is larger (i.e. for getting a given divergence value at σ = 5, one needs almost 10-fold more samples for σ = 15). This overhead of samples needed when smoothing increases is properly described, for the Gaussian-smoothed sliced SWD in our Proposition 3.8, as the sample complexity depends on the moments of the Gaussian. As for conclusion from these analyses, we highlight that the Gaussian-smoothed sliced MMD seems to present several strong benefits: its sample complexity does not depend on the dimension and seems to be the best one among the divergence we considered. More interestingly, it is not impacted by the amount of Gaussian smoothing and thus not impacted by a desired privacy level. 4.2 Domain adaptation with GσSW As an application, we have considered the problem of unsupervised domain adaptation for a classification task. In this context, given source examples Xs and their label ys and unlabeled target examples Xt, our goal is to design a classifier h( ) learned from the source examples that generalizes well on the target ones. A classical approach consists in learning a representation mapping g( ) that leads to invariant latent representations, invariance being measured as a distance between empirical distributions of mapped source and target samples. Formally, this leads to the following problem min g,h Lc(h(g(Xs)), ys) + D(g(Xs), g(Xt)) where Lc can be the cross-entropy loss or a quadratic loss and D a divergence between empirical distributions, in our case, D will be any Gaussian-smoothed sliced divergence. We solve this problem through stochastic gradient descent, similarly to many approaches that use sliced Wasserstein distance as a distribution distance Lee et al. (2019). Note that, in practice, using a smoothed divergence preserves the privacy of the target samples as shown by (Rakotomamonjy & Ralaivola, 2021). When performing such model adaptation, a privacy/utility trade-offthat has to be handled. In practice, one would prefer the most private model while not hurting its performance. Hence, one would seek the largest noise level σ > 0 to use while preserving accuracy on target domain. Hence, it is useful to evaluate how the model performs on a range of noise level (hence, privacy level). This can be computationally expensive at it requires to fully train several models on hundreds of epochs. Instead, we leverage on the continuity of our GσSD to employ a fine-tuning strategy: we train a domain adaptation model for the largest desired value of σ (over the full number of epochs) and when σ is decreased, we just fine-tune the lasted model by training on only one epoch. Our experiments evaluate the studied Gaussian-smoothed sliced divergences in classical unsupervised domain adaptation. We have considered two datasets: a handwritten digit recognition (USPS/MNIST) and Office 31 datasets. In our first analysis, we have compared our GσSD performances with non-smoothed divergences. The first one is the sliced Wasserstein distance (SWD) Lee et al. (2019) and the second one is the Jenssen-Shannon approximation based on adversarial approach, known as DANN Ganin & Lempitsky (2015). For all methods and for each dataset, we used the same neural network architecture for representation mapping and for classification. Approaches differ only on how distance between distributions have been computed. Here for each noise value σ, we have trained the model from scratch for 100 epochs. Results are depicted in Figure 4. For the two problems, we can see that performances obtained with the Gaussian-smoothed sliced Wasserstein or MMD divergences are similar to those obtained with DANN or SWD across all ranges of noise. The smoothed version of Sinkhorn is less stable and induces a slight loss of performance. Owing to the metric property and the induced weak topology, the privacy preservation comes almost without loss of performance in this domain adaptation context. In the second analysis, we have studied the privacy/utility trade-offwhen fine-tuning models, using only one epoch, for decreasing values of σ. Results are shown in Figure 5. They highlight that depending on the data and the used smoothed divergence, performance varies between one percent for Office 31 to four percent for USPS to MNIST. Note that except for the largest value of σ, we are training a model using only one epoch instead of a hundred. A very large gain in complexity is thus achieved for swiping the full range of noise level. Hence depending on the importance this slight drop in performance will have, it is worth using a large value of σ and preserving strong privacy or go through a validation procedure of several (cheaply obtained) models.
Researcher Affiliation Collaboration Mokhtar Z. Alaya EMAIL Université de Technologie de Compiègne, LMAC (Laboratoire de Mathématiques Appliquées de Compiègne), CS 60 319 60 203 Compiègne Cedex Alain Rakotomamonjy EMAIL Criteo AI Lab, Paris, France, Maxime Berar EMAIL Univ Rouen Normandie, INSA Rouen Normandie, Universite Le Havre Normandie Normandie Univ, LITIS UR4108, Rouen, France Gilles Gasso EMAIL INSA Rouen Normandie, Univ Rouen Normandie, Universite Le Havre Normandie, Normandie Univ, LITIS UR4108, Rouen, France
Pseudocode No The paper includes theoretical properties, proofs, and propositions (e.g., Section 3, Appendix A), but it does not contain any clearly labeled pseudocode or algorithm blocks. The methods are described narratively or mathematically.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository.
Open Datasets Yes Our experiments evaluate the studied Gaussian-smoothed sliced divergences in classical unsupervised domain adaptation. We have considered two datasets: a handwritten digit recognition (USPS/MNIST) and Office 31 datasets. Figure 6: Measuring the divergence between two sets of samples drawn iid from the CIFAR10 dataset.
Dataset Splits No The paper mentions using USPS/MNIST, Office 31, and CIFAR10 datasets. It discusses training models and evaluating performance in domain adaptation, but it does not specify any particular training/validation/test splits (e.g., percentages, counts, or references to standard splits with details) for these datasets. It only mentions using the same neural network architecture 'for each dataset' and training 'for 100 epochs' or 'one epoch' during fine-tuning.
Hardware Specification No The paper describes numerical experiments and evaluations but does not provide any specific details about the hardware used, such as GPU models, CPU specifications, or memory.
Software Dependencies No The paper describes experimental setups and parameters for divergences (e.g., 'entropy regularization has been set to 0.1' for Sinkhorn, 'Gaussian kernel' for MMD), but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For the Sinkhorn divergence, the entropy regularization has been set to 0.1 and for MMD, we used a Gaussian kernel for which the bandwidth has been set to the mean of all pairwise distances between samples. The number of projections has been fixed to L = 50 and we perform 20 runs per experiment. For the first study, the convergence rate has been evaluated by increasing the samples number up to 25,000 with fixed dimension d = 50. For the second one, we vary both the dimension and the number of samples. For all methods and for each dataset, we used the same neural network architecture for representation mapping and for classification. Approaches differ only on how distance between distributions have been computed. Here for each noise value σ, we have trained the model from scratch for 100 epochs. Instead, we leverage on the continuity of our GσSD to employ a fine-tuning strategy: we train a domain adaptation model for the largest desired value of σ (over the full number of epochs) and when σ is decreased, we just fine-tune the lasted model by training on only one epoch.