Betty: An Automatic Differentiation Library for Multilevel Optimization
Authors: Sang Keun Choe, Willie Neiswanger, Pengtao Xie, Eric Xing
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that BETTY can be used to implement an array of MLO programs, while also observing up to 11% increase in test accuracy, 14% decrease in GPU memory usage, and 20% decrease in training wall time over existing implementations on multiple benchmarks. We also showcase that BETTY enables scaling MLO to models with hundreds of millions of parameters. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University 2Stanford University 3UCSD 4MBZUAI |
| Pseudocode | Yes | 1 class My Problem(Problem): 2 def training_step(self, batch): 3 # Users define the cost function here 4 return cost_fn(batch, self.module, self.other_probs, ...) 5 config = Config(type="darts", unroll_steps=10, fp16=True, gradient_accumulation=4) 6 prob = My Problem("myproblem", config, module, optimizer, data_loader) Listing 1: Problem class example. |
| Open Source Code | Yes | We open-source the code at https://github.com/leopard-ai/betty. |
| Open Datasets | Yes | Following the original paper, we artificially inject class imbalance into the CIFAR-10 dataset by geometrically decreasing the number of data sample for each class, as per an imbalance factor. ... We test our framework on the WRENCH benchmark (Zhang et al., 2021a), which contains multiple weak supervision datasets. ... We conduct an experiment on the Office Home dataset (Venkateswara et al., 2017) that consists of 15,500 images from 65 classes and 4 domains: Art (Ar), Clipart (Cl), Product (Pr), and Real World (RW). |
| Dataset Splits | Yes | Dataset We reuse the long-tailed CIFAR-10 dataset from the original paper (Shu et al., 2019) as our inner-level training dataset. ... We randomly select 100 samples from the validation set to construct the upper-level (or meta) training dataset, and use the rest of it as the validation dataset, on which classification accuracy is reported in the main text. ... Dataset We split each domain of the Office Home dataset (Venkateswara et al., 2017) into training/validation/test datasets with a ratio of 5:3:2. ... Follwing the original paper (Liu et al., 2019), we use the first half of the CIFAR-10 training dataset as our inner-level training dataset (i.e. classification network) and the other half as the outer-level training dataset (i.e. architecture network). |
| Hardware Specification | No | The paper mentions 'GPU memory usage' and 'CUDA out-of-memory error' which implies the use of GPUs, but it does not specify any particular GPU models (e.g., NVIDIA A100, V100) or other hardware details like CPU models or memory. |
| Software Dependencies | No | The paper mentions software like PyTorch and optimizers like Adam, but it does not provide specific version numbers for these software components, which is required for reproducible dependency descriptions. |
| Experiment Setup | Yes | In this section, we provide further training details (e.g. hyperparameters) of each experiment. ... Meta-Weight-Network ... It is trained with the Adam optimizer (Kingma & Ba, 2014) whose learning rate is set to 0.00001 throughout the whole training procedure, momentum values to (0.9, 0.999), and weight decay value to 0. MWN is trained for 10,000 iterations and learning rate is fixed throughout training. Classification Network ... It is trained with the SGD optimizer whose initial learning rate is set to 0.1, momentum value to 0.9, and weight decay value to 0.0005. Training is performed for 10,000 iterations, and we decay the learning rate by a factor of 10 on the iterations of 5,000 and 7,500. |