Sparsity-Constrained Optimal Transport
Authors: Tianlin Liu, Joan Puigcerver, Mathieu Blondel
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our framework in 6 and in Appendix A through a variety of experiments. ... We applied sparsity-constrained OT to vision sparse mixtures of experts (V-Mo E) models for large-scale image recognition (Riquelme et al., 2021). ... Table 2 summarizes the validation accuracy on JFT300M and 10-shot accuracy on Image Net. |
| Researcher Affiliation | Collaboration | Tianlin Liu University of Basel Joan Puigcerver Google Research, Brain team Mathieu Blondel Google Research, Brain team |
| Pseudocode | No | The paper references Algorithm 1 from Riquelme et al. (2021) in Appendix A.5, but it does not contain a pseudocode or algorithm block for its own proposed method. |
| Open Source Code | No | The paper does not include any statement about releasing source code or provide a link to a code repository for its methodology. |
| Open Datasets | Yes | We train on the JFT-300M dataset (Sun et al., 2017), which is a large scale dataset that contains more than 305 million images. We then perform 10-shot transfer learning on the Image Net dataset (Deng et al., 2009). |
| Dataset Splits | Yes | JFT-300M has around 305M training and 50,000 validation images. ... For downstream evaluations, we perform 10-shot linear transfer on Image Net (Deng et al., 2009). ... This newly initialized layer is trained on 10 examples per Image Net class (10-shot learning). |
| Hardware Specification | Yes | Table 4: Total Training TPUv2-core-days |
| Software Dependencies | No | The paper mentions software like LBFGS, ADAM, PyTorch (implicitly, as it's a deep learning paper), but does not specify their version numbers (e.g., "ADAM optimizer with a learning rate of 10^-2"). |
| Experiment Setup | Yes | We do so by using an ADAM optimizer with a learning rate of 10^-2 for 50 steps. ... The buffer capacity is set to be n/κ = 32/2 = 16, that is, each expert can take 16 tokens at most. To match this setting, we use k = 16 in (18) for our sparsity-constrained router. ... Algorithms employing an OT-based approach perform 500 iterations to find T, using either the Sinkhorn algorithm (with the Negentropy method) or LBFGS (used by the rest of OT-based methods). We use a sparsity-constraint of k = 1.15 m/n |