Self-supervised Representation Learning with Relative Predictive Coding

Authors: Yao-Hung Hubert Tsai, Martin Q. Ma, Muqiao Yang, Han Zhao, Louis-Philippe Morency, Ruslan Salakhutdinov

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper introduces Relative Predictive Coding (RPC), a new contrastive representation learning objective that maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. The key to the success of RPC is two-fold. First, RPC introduces the relative parameters to regularize the objective for boundedness and low variance. Second, RPC contains no logarithm and exponential score functions, which are the main cause of training instability in prior contrastive objectives. We empirically verify the effectiveness of RPC on benchmark vision and speech self-supervised learning tasks. Lastly, we relate RPC with mutual information (MI) estimation, showing RPC can be used to estimate MI with low variance.
Researcher Affiliation Collaboration Yao-Hung Hubert Tsai1, Martin Q. Ma1, Muqiao Yang1, Han Zhao23, Louis-Philippe Morency1, Ruslan Salakhutdinov1 1Carnegie Mellon University, 2D.E. Shaw & Co., 3 University of Illinois at Urbana-Champaign
Pseudocode No The paper provides mathematical formulations and descriptions of its method but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Project page: https://github.com/martinmamql/relative_predictive_coding
Open Datasets Yes Datasets. For the visual objective classification, we consider CIFAR-10/-100 (Krizhevsky et al., 2009), STL-10 (Coates et al., 2011), and Image Net (Russakovsky et al., 2015) and speech recognition on Libri Speech (Panayotov et al., 2015).
Dataset Splits No While the paper mentions 'fine-tuning and evaluation' and the use of 'test split', it does not explicitly state specific proportions or details for a validation split (e.g., '80/10/10 split', or how validation sets were used for hyperparameter tuning).
Hardware Specification Yes Image Net Following the settings in (Chen et al., 2020b;c), we train the model on Cloud TPU with 128 cores... CIFAR-10/-100... train the model on a single GPU... We would also like to acknowledge NVIDIA s GPU support and Cloud TPU support from Google s Tensor Flow Research Cloud (TFRC).
Software Dependencies No The paper mentions using specific optimizers (LARS, Adam) and frameworks like TensorFlow Research Cloud, but it does not provide specific version numbers for any software dependencies (e.g., 'PyTorch 1.9', 'TensorFlow 2.x').
Experiment Setup Yes Image Net Following the settings in (Chen et al., 2020b;c), we train the model on Cloud TPU with 128 cores, with a batch size of 4,096... We use the LARS optimizer (You et al., 2017) with momentum 0.9. The learning rate linearly increases for the first 20 epochs, reaching a maximum of 6.4, then decayed with cosine decay schedule. The weight decay is 10^-4. We train the model for only 100 epochs... For JRPC we disable hidden normalization and use a temperature τ = 32. For all other objectives, we use hidden normalization and τ = 0.1... For relative parameters, we use α = 0.3, β = 0.001, γ = 0.1 and α = 0.3, β = 0.001, γ = 0.005 for Res Net-50 and Res Net-152 respectively.