GIO: Gradient Information Optimization for Training Dataset Selection
Authors: Dante Everaert, Christopher Potts
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments with machine translation, spelling correction, and image recognition, we show that GIO delivers outstanding results with very small train sets. |
| Researcher Affiliation | Collaboration | Dante Everaert Amazon Search Science danteev@amazon.com dante.everaert@gmail.com Christopher Potts Stanford University cgpotts@stanford.edu |
| Pseudocode | Yes | Algorithm 1 Gradient Information Optimization |
| Open Source Code | Yes | We open source a pip-installable implementation of the algorithm as "pip install grad-info-opt".1 Also available at: https://github.com/daeveraert/gradient-information-optimization |
| Open Datasets | Yes | We first explore machine translation using the WMT14 dataset and Transformer-based models. In this case, G is the WMT14 dataset and X is the dev set. |
| Dataset Splits | Yes | For GIO and random, we cut 1,700 as validation and leave the remaining 15,000 (25%) as our training data. For the full model, we follow this percentage and cut 3,700 as validation and keep 56,300 as the training data. |
| Hardware Specification | Yes | We used AWS p3dn.24xlarge machines to generate the embeddings, which have 8 NVIDIA Tesla V100 GPUs and as a benchmark, takes roughly 4 hours to generate 15M embeddings (therefore approx. 8 hours for EN-FR and approx. 1 hour for EN-DE). This process is highly parallelizable for speed, and additionally, more lightweight models like Mini LM (which takes roughly half the time) and even non-neural methods like Sentence2Vec can be used under speed and resource constraints. Across all initializations, we use the following parameters: K: 1500 k in ˆDKL: 5 Max iterations: 1000 Stopping Criterion: increase v_init: prev_opt Resets Allowed: false Iterations in Gradient Descent: 50 Gradient Descent Learning Rate: 0.01 For 0% initialization, we use a uniform start of 20 points, spread from -1 to 1 with 768 dimensions. For 25% and 50% initialization, we start with a random subsample of 375 and 750 clusters respectively, out of the 1500. The following is the quantization and main method signatures for 0% initialization: ... # Initialize class gio_kl = GIOKL.GIOKL() # Read data train, target = gio_kl.read_data_from_csv(PATH_TRAIN, PATH_TARGET) # Quantize data model_train, model_X, transform_train, transformed_X = gio_kl.quantize(train, target, k=1500) X = jnp.array(model_X.cluster Centers()) train = jnp.array(model_train.cluster Centers()) data = [(i, each.tolist()) for i, each in enumerate(model_train.cluster Centers())] centroids_df = gio_kl.spark.create Data Frame(data=data, schema=[ |
| Software Dependencies | No | The paper mentions using Fairseq, PySpark, JAX, and Matplotlib but does not provide specific version numbers for these software dependencies. It only lists versions for specific pre-trained models like MPNet-Base-V2 and Mini LM-L12-v1. |
| Experiment Setup | Yes | We use Transformer Big from Vaswani et al. (2017), trained for 300k iterations with the same hyperparameters. We use the same processed WMT14 training data. |