Auto-Transfer: Learning to Route Transferable Representations

Authors: Keerthiram Murugesan, Vijay Sadashivaiah, Ronny Luss, Karthikeyan Shanmugam, Pin-Yu Chen, Amit Dhurandhar

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present experimental results to validate our Auto-Transfer methods. We first show the improvements in model accuracy that can be achieved over various baselines on six different datasets (section A.3) and two network/task setups.
Researcher Affiliation Collaboration 1IBM Research, Yorktown Heights 2Rensselaer Polytechnic Institute, New york
Pseudocode Yes Algorithm 1 AMAB Update Algorithm for Target Layer ℓ; Algorithm 2 TRAIN-TARGET Train Target Network; Algorithm 3 EVALUATE Evaluate Target Network
Open Source Code Yes 1Code available at https://github.com/IBM/auto-transfer
Open Datasets Yes We apply our method to four target tasks: Caltech-UCSD Bird 200 (Wah et al., 2011), MIT Indoor Scene Recognition (Quattoni & Torralba, 2009), Stanford 40 Actions (Yao et al., 2011) and Stanford Dogs (Khosla et al., 2011). For Tiny Image Net based transfer, we apply our method on two target tasks: CIFAR100 (Krizhevsky et al., 2009), STL-10 (Coates et al., 2011).
Dataset Splits No The bandit algorithm intervenes once every epoch of training to make choices using rewards from evaluation of the combined network on a hold out set, while the latest choice made by the bandit is used by the training algorithm to update the target network parameters on the target task. (...) Reward function: The reward rt for the selected routing choice is then computed by evaluating gain in the loss due to the chosen source-target combination as follows: the prediction gain is the difference between the target network s losses on a hold out set Dv with and without the routing choice at i.e., L(f M T (x)) L( f M T (x)) for a given image x from the hold out data.
Hardware Specification Yes The target models were trained in parallel on two machines with the specifications shown in Table 2. Resource Setting CPU Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz Memory 128GB GPUs 1 x NVIDIA Tesla V100 16 GB Disk 600GB OS Ubuntu 18.04-64 Minimal for VSI.
Software Dependencies No The paper mentions optimizers (SGD, ADAM) and a learning rate scheduler (Cosine Annealing) but does not provide specific version numbers for software libraries or frameworks like Python, PyTorch, TensorFlow, or CUDA, which are necessary for full reproducibility.
Experiment Setup Yes For our experimental analysis in the main paper, we set the number of epochs for training to E = 200. The learning rate for SGD is set to 0.1 with momentum 0.9 and weight decay 0.001. The learning rate for the ADAM is set to 0.001 with and weight decay of 0.001. We use Cosine Annealing learning rate scheduler for both optimizers. The batch size for training is set to 64.