Winner-takes-all learners are geometry-aware conditional density estimators

Authors: Victor Letzelter, David Perera, Cédric Rommel, Mathieu Fontaine, Slim Essid, Gaël Richard, Patrick Perez

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically substantiate our estimator through experiments on both synthetic and real-world data, including audio data.1
Researcher Affiliation Collaboration 1Valeo.ai, Paris, France 2LTCI, T el ecom Paris, Institut Polytechnique de Paris, France 3Meta AI, Paris, France 4Kyutai, Paris, France.
Pseudocode No The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes 1Code at https://github.com/Victorletzelter/Voronoi WTA.
Open Datasets Yes UCI Regression datasets (Dua & Graff, 2017) are a standard benchmark (Hern andez-Lobato & Adams, 2015) to evaluate conditional density estimators.
Dataset Splits Yes The models were trained until convergence of the training loss, using early stopping on the validation loss. Each of the synthetic datasets consists of 100, 000 training points, and 25, 000 validation points. Post-training, the scaling factor h was tuned based on the average NLL over the validation set (20 % of the training data) using a golden section search (Kiefer, 1953).
Hardware Specification Yes The training of our neural networks was conducted on NVIDIA A100 GPUs.
Software Dependencies No The paper mentions software like 'Python programming language', 'Pytorch (Paszke et al., 2019) deep learning framework', 'Adam W optimizer (Loshchilov & Hutter, 2018)', and 'Hydra and MLFlow libraries'. However, it does not provide specific version numbers for these software components (e.g., PyTorch 1.9, Python 3.8), which is required for reproducibility.
Experiment Setup Yes In each training setup with synthetic data, we used a three-layer MLP, with 256 hidden units. The Adam optimizer (Kingma & Ba, 2014) was used with a constant learning rate of 0.001 in each setup. The models were trained until convergence of the training loss, using early stopping to select the checkpoint for which the validation loss was the lowest. Each of the models was trained for 100 epochs, with a batch size of 1024. We utilized Seld Net (Adavanne et al., 2018a) as backbone (with 1.6 M parameters). The Adam W optimizer (Loshchilov & Hutter, 2018) was used, with a batch size of 32, an initial learning rate of 0.05, and following the scheduling scheme from Vaswani et al. (2017). The WTA model was trained using the multi-target version of the Winner-takes-all loss (Equation 2 and 5 of Letzelter et al. (2023)), using confidence weight β = 1. The underlying loss ℓused was the spherical distance ℓ(ˆy, y) = arccos[ˆy y].