reproducibilityindex.ai

Overparameterisation and worst-case generalisation: friend or foe?

Authors: Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verify that with such post-hoc correction, overparameterisation can improve average and worst-case performance. and Table 1 summarises the test set results on all datasets.
Researcher Affiliation	Industry	Aditya Krishna Menon, Ankit Singh Rawat & Sanjiv Kumar Google Research New York, NY {adityakmenon,ankitsrawat,sanjivk}@google.com
Pseudocode	No	The paper describes the correction procedures in prose (Sections 4.1 and 4.2) but does not include any formal pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing code or provide any links to a code repository.
Open Datasets	Yes	In the sequel, we shall make extensive use of three datasets from Sagawa et al. (2020a;b), each of which involve binary labels y Y and a binary attribute a(x) A: (i) synth, a synthetic dataset where X R200, Y = { 1}, and A { 1}. (ii) waterbirds, a dataset of bird images with Y = {land bird, water bird} corresponding to the bird type, and A = {land background, water background} corresponding to the background. (iii) celeb A, a dataset of celebrity images with Y = {blond, dark} corresponding to individuals hair colour, and A = {male, female}. (Citing Sagawa et al. (2020a;b) for datasets.)
Dataset Splits	Yes	We measure both the average and worst-subgroup errors on both the train and test set, repeating each experiment 5 times. and We apply post-hoc correction to these learned models, via classiﬁer retraining (CRT) on the learned representations, using a linear logistic regression model with subsampling of the dominant subgroups per Sagawa et al. (2020b); and threshold correction (THR) on the decision scores, using a holdout set to estimate thresholds {ta : a { 1}} that minimise the worst-subgroup error. For waterbirds, we use the holdout set from Sagawa et al. (2020a); for celeb A, we use the standard holdout set; and for synth, we construct a holdout set using 20% of the training samples.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments. It only mentions training 'models'.
Software Dependencies	No	The paper mentions 'Logistic Regression package in sklearn' but does not specify the version of sklearn or any other software dependencies like TensorFlow/PyTorch versions for the ResNet-50 experiments.
Experiment Setup	Yes	For the Res Net-50 experiments... We train the models using SGD with a momentum value of 0.9. We use a batch size of 128, weight decay 10-4, and a learning rate of decayed according to a cosine schedule. We train with a base learning rate of 10-4 for 1000 epochs on waterbirds, and a base learning rate of 10-2 for 50 epochs on celeb A.