Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup
Authors: Muthu Chidambaram, Xiang Wang, Chenwei Wu, Rong Ge
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show in Section 5 that our theory extends to practice by training models on image classification benchmarks that are modified to have additional spurious features correlated with the true class labels. We find in our experiments that Midpoint Mixup outperforms ERM, and performs comparably to the previously used Mixup settings in Zhang et al. (2018). 5. Experiments |
| Researcher Affiliation | Academia | Muthu Chidambaram 1 Xiang Wang 1 Chenwei Wu 1 Rong Ge 1 1Department of Computer Science, Duke University. Correspondence to: Muthu Chidambaram <muthu@cs.duke.edu>. |
| Pseudocode | No | We denote our network by g : RP d Rk. For each y [k], we define gy as follows. p [P ] Re LU D wy,r, x(p)E (4.1) We will use w(0) y,r to refer to the weights of the network g at initialization (and w(t) y,r after t steps of gradient descent), and similarly gt to refer to the model after t iterations of gradient descent. We consider the standard choice of Xavier initialization, which, in our setting, corresponds to w(0) y,r N(0, 1 For model training, we focus on full batch gradient descent with a fixed learning rate of η applied to J(g, X) and JMM(g, X). Once again using the notation w(t) y,r for w(t) y,r , the updates to the weights of the network g are thus of the form: w(t+1) y,r = w(t) y,r η w(t) y,r JMM(g, X) (4.2) |
| Open Source Code | Yes | Code for our experiments is available at: https://github.com/2014mchidamb/ midpoint-mixup-multi-view-icml. |
| Open Datasets | Yes | For our experimental setup, we consider training ResNet-18 (He et al., 2015) on versions of Fashion MNIST (FMNIST) (Xiao et al., 2017), CIFAR-10, and CIFAR-100 (Krizhevsky, 2009) |
| Dataset Splits | No | All models were trained for 100 epochs with a batch size of 750, which was the largest feasible size on our compute setup of a single P100 GPU (we use a large batch size to approximate the full batch gradient descent aspect of our theory). |
| Hardware Specification | Yes | All models were trained for 100 epochs with a batch size of 750, which was the largest feasible size on our compute setup of a single P100 GPU (we use a large batch size to approximate the full batch gradient descent aspect of our theory). |
| Software Dependencies | No | Our implementation is in Py Torch (Paszke et al., 2019) and uses the ResNet implementation of Kuang Liu, released under an MIT license. All models were trained for 100 epochs with a batch size of 750... For optimization, we use Adam (Kingma & Ba, 2015) with the default hyperparameters of β1 = 0.9, β2 = 0.999 and a learning rate of 0.001. |
| Experiment Setup | Yes | All models were trained for 100 epochs with a batch size of 750... For optimization, we use Adam (Kingma & Ba, 2015) with the default hyperparameters of β1 = 0.9, β2 = 0.999 and a learning rate of 0.001. |