Decomposed Mutual Information Estimation for Contrastive Representation Learning

Authors: Alessandro Sordoni, Nouha Dziri, Hannes Schulz, Geoff Gordon, Philip Bachman, Remi Tachet Des Combes

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that DEMI can capture a larger amount of MI than standard non-decomposed contrastive bounds in a synthetic setting, and learns better representations in a vision domain and for dialogue generation. Finally, we present evidence of the effectiveness of the proposed method in vision and in dialogue generation. We verify (1) in a synthetic experiment where we control the total amount of MI between Gaussian covariates. Then, we verify (2) on a self-supervised image representation learning domain and explore an additional application to natural language generation in a sequential setting: conversational dialogue. Table 1 reports the average accuracy of linear evaluations obtained by 3 pretraining seeds.
Researcher Affiliation Collaboration 1Microsoft Research 2University of Alberta. Correspondence to: Alessandro Sordoni <alsordon@microsoft.com>, Nouha Dziri <dziri@cs.ualberta.ca>.
Pseudocode Yes Listing 1: Py Torch-style pseudo-code for DEMI in Info Min. We use IBO to estimate the critic for conditional MI.
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes We study self-supervised learning of image representations using 224 224 images from Image Net (Deng et al., 2009). We report transfer learning performance by freezing the encoder on STL-10, CIFAR-10 and CIFAR100 (Krizhevsky et al., 2009), Stanford Cars (Krause et al., 2013), Caltech-UCSD Birds (CUB) (Welinder et al., 2010) and Oxford 102 Flowers (Nilsback & Zisserman, 2008). We experiment with language modeling task on the Wizard of Wikipedia (Wo W) dataset (Dinan et al., 2019).
Dataset Splits No The paper mentions evaluating on a 'validation set' and states 'All hyperparameters for training and evaluation are the same as in Tian et al. (2020)', implying standard splits for well-known datasets. However, it does not explicitly provide the exact percentages or counts for training/validation/test splits within the text, nor does it cite the specific split methodology from Tian et al. (2020).
Hardware Specification No The paper does not specify the exact GPU models, CPU models, or any other detailed hardware specifications used for running the experiments. It only mentions general aspects like 'Resnet50 backbone' (an architecture) and training parameters.
Software Dependencies No The paper mentions 'Py Torch-style pseudo-code' and using 'GPT2' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes All models use a momentum-contrastive memory buffer of K = 65536 examples (Chen et al., 2020b). All models use a Resnet50 backbone and are trained for 200 epochs. We train with learning rate 0.5, batch-size 800, momentum coefficient of 0.9 and cosine annealing schedule. Our energy function is the cosine similarity between representations scaled by a temperature of 0.5 (Chen et al., 2020b).