Overparameterisation and worst-case generalisation: friend or foe?

Authors: Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically verify that with such post-hoc correction, overparameterisation can improve average and worst-case performance. and Table 1 summarises the test set results on all datasets.
Researcher Affiliation Industry Aditya Krishna Menon, Ankit Singh Rawat & Sanjiv Kumar Google Research New York, NY {adityakmenon,ankitsrawat,sanjivk}@google.com
Pseudocode No The paper describes the correction procedures in prose (Sections 4.1 and 4.2) but does not include any formal pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing code or provide any links to a code repository.
Open Datasets Yes In the sequel, we shall make extensive use of three datasets from Sagawa et al. (2020a;b), each of which involve binary labels y Y and a binary attribute a(x) A: (i) synth, a synthetic dataset where X R200, Y = { 1}, and A { 1}. (ii) waterbirds, a dataset of bird images with Y = {land bird, water bird} corresponding to the bird type, and A = {land background, water background} corresponding to the background. (iii) celeb A, a dataset of celebrity images with Y = {blond, dark} corresponding to individuals hair colour, and A = {male, female}. (Citing Sagawa et al. (2020a;b) for datasets.)
Dataset Splits Yes We measure both the average and worst-subgroup errors on both the train and test set, repeating each experiment 5 times. and We apply post-hoc correction to these learned models, via classifier retraining (CRT) on the learned representations, using a linear logistic regression model with subsampling of the dominant subgroups per Sagawa et al. (2020b); and threshold correction (THR) on the decision scores, using a holdout set to estimate thresholds {ta : a { 1}} that minimise the worst-subgroup error. For waterbirds, we use the holdout set from Sagawa et al. (2020a); for celeb A, we use the standard holdout set; and for synth, we construct a holdout set using 20% of the training samples.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments. It only mentions training 'models'.
Software Dependencies No The paper mentions 'Logistic Regression package in sklearn' but does not specify the version of sklearn or any other software dependencies like TensorFlow/PyTorch versions for the ResNet-50 experiments.
Experiment Setup Yes For the Res Net-50 experiments... We train the models using SGD with a momentum value of 0.9. We use a batch size of 128, weight decay 10-4, and a learning rate of decayed according to a cosine schedule. We train with a base learning rate of 10-4 for 1000 epochs on waterbirds, and a base learning rate of 10-2 for 50 epochs on celeb A.