On Mutual Information Maximization for Representation Learning

Authors: Michael Tschannen, Josip Djolonga, Paul K. Rubenstein, Sylvain Gelly, Mario Lucic

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we argue, and provide empirical evidence, that the success of these methods cannot be attributed to the properties of MI alone, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators.
Researcher Affiliation Collaboration Michael Tschannen Josip Djolonga Paul K. Rubenstein Sylvain Gelly Mario Lucic Google Research, Brain Team Ph D student at University of Cambridge and the Max Planck Institute for Intelligent Systems, Tübingen. Correspondence to Michael Tschannen (tschannen@google.com), Josip Djolonga (josipd@google.com), and Mario Lucic (lucic@google.com).
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code for running the experiments and visualizing the results is available at https://github.com/googleresearch/google-research/tree/master/mutual_information_representation_learning.
Open Datasets Yes To this end, we consider a simple setup of learning a representation of the top half of MNIST handwritten digit images (we present results for the experiments from Sections 3.2 and 3.3 on CIFAR10 in Appendix G
Dataset Splits No The paper mentions using MNIST and CIFAR10 datasets but does not explicitly provide details about a validation dataset split (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No We train g1, g2, and f using the Adam optimizer (Kingma and Ba, 2015), and use g1(xtop) as the representation for the linear evaluation. Unless stated otherwise, we use a bilinear critic f(x, y) = x Wy (we investigate its effect in a separate ablation study), set the batch size to 128 and the learning rate to 10 4.
Experiment Setup Yes Unless stated otherwise, we use a bilinear critic f(x, y) = x Wy (we investigate its effect in a separate ablation study), set the batch size to 128 and the learning rate to 10 4.5 Throughout, IEST values and downstream classification accuracies are averaged over 20 runs and reported on the testing set