Self-supervised Representation Learning with Relative Predictive Coding
Authors: Yao-Hung Hubert Tsai, Martin Q. Ma, Muqiao Yang, Han Zhao, Louis-Philippe Morency, Ruslan Salakhutdinov
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper introduces Relative Predictive Coding (RPC), a new contrastive representation learning objective that maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. The key to the success of RPC is two-fold. First, RPC introduces the relative parameters to regularize the objective for boundedness and low variance. Second, RPC contains no logarithm and exponential score functions, which are the main cause of training instability in prior contrastive objectives. We empirically verify the effectiveness of RPC on benchmark vision and speech self-supervised learning tasks. Lastly, we relate RPC with mutual information (MI) estimation, showing RPC can be used to estimate MI with low variance. |
| Researcher Affiliation | Collaboration | Yao-Hung Hubert Tsai1, Martin Q. Ma1, Muqiao Yang1, Han Zhao23, Louis-Philippe Morency1, Ruslan Salakhutdinov1 1Carnegie Mellon University, 2D.E. Shaw & Co., 3 University of Illinois at Urbana-Champaign |
| Pseudocode | No | The paper provides mathematical formulations and descriptions of its method but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Project page: https://github.com/martinmamql/relative_predictive_coding |
| Open Datasets | Yes | Datasets. For the visual objective classification, we consider CIFAR-10/-100 (Krizhevsky et al., 2009), STL-10 (Coates et al., 2011), and Image Net (Russakovsky et al., 2015) and speech recognition on Libri Speech (Panayotov et al., 2015). |
| Dataset Splits | No | While the paper mentions 'fine-tuning and evaluation' and the use of 'test split', it does not explicitly state specific proportions or details for a validation split (e.g., '80/10/10 split', or how validation sets were used for hyperparameter tuning). |
| Hardware Specification | Yes | Image Net Following the settings in (Chen et al., 2020b;c), we train the model on Cloud TPU with 128 cores... CIFAR-10/-100... train the model on a single GPU... We would also like to acknowledge NVIDIA s GPU support and Cloud TPU support from Google s Tensor Flow Research Cloud (TFRC). |
| Software Dependencies | No | The paper mentions using specific optimizers (LARS, Adam) and frameworks like TensorFlow Research Cloud, but it does not provide specific version numbers for any software dependencies (e.g., 'PyTorch 1.9', 'TensorFlow 2.x'). |
| Experiment Setup | Yes | Image Net Following the settings in (Chen et al., 2020b;c), we train the model on Cloud TPU with 128 cores, with a batch size of 4,096... We use the LARS optimizer (You et al., 2017) with momentum 0.9. The learning rate linearly increases for the first 20 epochs, reaching a maximum of 6.4, then decayed with cosine decay schedule. The weight decay is 10^-4. We train the model for only 100 epochs... For JRPC we disable hidden normalization and use a temperature τ = 32. For all other objectives, we use hidden normalization and τ = 0.1... For relative parameters, we use α = 0.3, β = 0.001, γ = 0.1 and α = 0.3, β = 0.001, γ = 0.005 for Res Net-50 and Res Net-152 respectively. |