Unveiling the Dynamics of Information Interplay in Supervised Learning
Authors: Kun Song, Zhiquan Tan, Bochao Zou, Huimin Ma, Weiran Huang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that MIR and HDR can effectively explain many phenomena occurring in neural networks, for example, the standard supervised training dynamics, linear mode connectivity, and the performance of label smoothing and pruning. Additionally, we use MIR and HDR to gain insights into the dynamics of grokking, which is an intriguing phenomenon observed in supervised training, where the model demonstrates generalization capabilities long after it has learned to fit the training data. Furthermore, we introduce MIR and HDR as loss terms in supervised and semi-supervised learning to optimize the information interactions among samples and classification heads. The empirical results provide evidence of the method s effectiveness, demonstrating that the utilization of MIR and HDR not only aids in comprehending the dynamics throughout the training process but can also enhances the training procedure itself. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology Beijing 2Department of Mathematical Sciences, Tsinghua University 3MIFA Lab, Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University 4Shanghai AI Laboratory. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating that the source code for the described methodology is open-source or publicly available. |
| Open Datasets | Yes | Our models are trained on CIFAR-10 and CIFAR-100. [...] We conduct experiments on CIFAR-10 and CIFAR-100. [...] Following Nanda et al. (2022); Tan & Huang (2023), we train a transformer to learn modular addition c (a + b) (mod p), with p being 113. [...] Torch SSL (Zhang et al., 2021), a sophisticated codebase encompassing a wide array of semi-supervised learning techniques as well as supervised learning implementations, was employed as our foundational code base. This enables us to implement our algorithm effectively and assess its performance on well-established datasets like CIFAR-10, CIFAR-100, and STL-10. |
| Dataset Splits | No | The paper mentions using a 'test set' and a 70% split for testing in one specific experiment, but it does not provide explicit training/validation/test splits (e.g., percentages or sample counts) for the main datasets (CIFAR-10, CIFAR-100, STL-10) or mention a validation set. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Torch SSL' but does not provide specific version numbers for it or any other key software dependencies required for replication. |
| Experiment Setup | Yes | The default experimental configuration comprise training the models with an SGD optimizer (momentum of 0.9, weight decay of 5e 4), an initial learning rate of 0.03 with cosine annealing, a batch size of 64, and a total of 220 training iterations. The backbone architecture is Wide Res Net-28-2 for CIFAR-10 and Wide Res Net-28-8 for CIFAR-100. [...] We train the model using full-batch gradient descent, a learning rate of 0.001, and an Adam W optimizer with a weight decay parameter of 1. [...] We use an SGD optimizer, configured with a momentum of 0.9 and a weight decay parameter of 5e 4. The learning rate was initially set at 0.03, subject to cosine annealing. [...] The batch size are maintained at 64 across a comprehensive 1,048,000 iterations training regimen. |