How to prepare your task head for finetuning
Authors: Yi Ren, Shangmin Guo, Wonho Bae, Danica J. Sutherland
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analytically prove this trend in an overparamterized linear setting, and verify its applicability to different experimental settings. and we find a non-trivial trend in feature adaptation and verify it in many cases; and we show how controlling feature adaptation can improve downstream performance. |
| Researcher Affiliation | Academia | Yi Ren University of British Columbia renyi.joshua@gmail.com Shangmin Guo University of Edinburgh s.guo@ed.ac.uk Wonho Bae University of British Columbia whbae@cs.ubc.ca Danica J. Sutherland University of British Columbia & Amii dsuth@cs.ubc.ca |
| Pseudocode | No | No explicitly labeled 'Pseudocode' or 'Algorithm' block is present in the paper. |
| Open Source Code | Yes | Code is available at https://github.com/Joshua-Ren/how_to_prepare_taskhead. |
| Open Datasets | Yes | MNIST (Le Cun, 1998), ImageNet-1K (Deng et al., 2009), CIFAR10 (Krizhevsky et al., 2009), PASCAL VOC (Everingham et al., 2015), STL10 (Coates et al., 2011), Flowers102 (Nilsback & Zisserman, 2008), Stanford Cars (Krause et al., 2013), Domain Net (Peng etol., 2019), ogbg-moltox21 (Wu et al., 2018), ogbg-molhiv (Hu et al., 2020), ogbg-molpcba (Hu et al., 2020). These datasets are all standard, publicly available benchmarks, and are cited appropriately. |
| Dataset Splits | Yes | Table 4: Datasets (vision and molecular graph) used in experiments. lists '# train' and '# test' columns for all datasets (e.g., 'MNIST 60,000 train, 10,000 test'). The paper also mentions 'validation accuracy after finetuning (FT-valid-acc for short)' and 'sweeping the optimal τ using validation accuracy'. |
| Hardware Specification | No | No specific hardware details (such as GPU/CPU models, memory, or specific cloud instances) are provided for the experiments. The paper generally mentions 'huge computing resources' but no specifications. |
| Software Dependencies | No | The paper mentions implementing models like ResNet, MLP, and GCN, and discusses training with SGD, but does not provide specific version numbers for software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The paper provides specific hyperparameters and training details: 'batch size is 128, hidden layer width is 128 (in the MLP head case)', 'learning rate 10 3, with cosine scheduler', '5 10 4 weight decay', 'simple augmentations like random flipping and cropping', 'HP learning rate is 3 10 2', 'maximum FT epochs is 200', 'SGD with momentum (β = 0.9)' and 'batch size of 16, and a SGD optimizer with momention (β = 0.9) but without weight decay nor learning rate scheduler'. |