Decomposed Mutual Information Estimation for Contrastive Representation Learning
Authors: Alessandro Sordoni, Nouha Dziri, Hannes Schulz, Geoff Gordon, Philip Bachman, Remi Tachet Des Combes
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that DEMI can capture a larger amount of MI than standard non-decomposed contrastive bounds in a synthetic setting, and learns better representations in a vision domain and for dialogue generation. Finally, we present evidence of the effectiveness of the proposed method in vision and in dialogue generation. We verify (1) in a synthetic experiment where we control the total amount of MI between Gaussian covariates. Then, we verify (2) on a self-supervised image representation learning domain and explore an additional application to natural language generation in a sequential setting: conversational dialogue. Table 1 reports the average accuracy of linear evaluations obtained by 3 pretraining seeds. |
| Researcher Affiliation | Collaboration | 1Microsoft Research 2University of Alberta. Correspondence to: Alessandro Sordoni <alsordon@microsoft.com>, Nouha Dziri <dziri@cs.ualberta.ca>. |
| Pseudocode | Yes | Listing 1: Py Torch-style pseudo-code for DEMI in Info Min. We use IBO to estimate the critic for conditional MI. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We study self-supervised learning of image representations using 224 224 images from Image Net (Deng et al., 2009). We report transfer learning performance by freezing the encoder on STL-10, CIFAR-10 and CIFAR100 (Krizhevsky et al., 2009), Stanford Cars (Krause et al., 2013), Caltech-UCSD Birds (CUB) (Welinder et al., 2010) and Oxford 102 Flowers (Nilsback & Zisserman, 2008). We experiment with language modeling task on the Wizard of Wikipedia (Wo W) dataset (Dinan et al., 2019). |
| Dataset Splits | No | The paper mentions evaluating on a 'validation set' and states 'All hyperparameters for training and evaluation are the same as in Tian et al. (2020)', implying standard splits for well-known datasets. However, it does not explicitly provide the exact percentages or counts for training/validation/test splits within the text, nor does it cite the specific split methodology from Tian et al. (2020). |
| Hardware Specification | No | The paper does not specify the exact GPU models, CPU models, or any other detailed hardware specifications used for running the experiments. It only mentions general aspects like 'Resnet50 backbone' (an architecture) and training parameters. |
| Software Dependencies | No | The paper mentions 'Py Torch-style pseudo-code' and using 'GPT2' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | All models use a momentum-contrastive memory buffer of K = 65536 examples (Chen et al., 2020b). All models use a Resnet50 backbone and are trained for 200 epochs. We train with learning rate 0.5, batch-size 800, momentum coefficient of 0.9 and cosine annealing schedule. Our energy function is the cosine similarity between representations scaled by a temperature of 0.5 (Chen et al., 2020b). |