GIO: Gradient Information Optimization for Training Dataset Selection

Authors: Dante Everaert, Christopher Potts

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments with machine translation, spelling correction, and image recognition, we show that GIO delivers outstanding results with very small train sets.
Researcher Affiliation Collaboration Dante Everaert Amazon Search Science danteev@amazon.com dante.everaert@gmail.com Christopher Potts Stanford University cgpotts@stanford.edu
Pseudocode Yes Algorithm 1 Gradient Information Optimization
Open Source Code Yes We open source a pip-installable implementation of the algorithm as "pip install grad-info-opt".1 Also available at: https://github.com/daeveraert/gradient-information-optimization
Open Datasets Yes We first explore machine translation using the WMT14 dataset and Transformer-based models. In this case, G is the WMT14 dataset and X is the dev set.
Dataset Splits Yes For GIO and random, we cut 1,700 as validation and leave the remaining 15,000 (25%) as our training data. For the full model, we follow this percentage and cut 3,700 as validation and keep 56,300 as the training data.
Hardware Specification Yes We used AWS p3dn.24xlarge machines to generate the embeddings, which have 8 NVIDIA Tesla V100 GPUs and as a benchmark, takes roughly 4 hours to generate 15M embeddings (therefore approx. 8 hours for EN-FR and approx. 1 hour for EN-DE). This process is highly parallelizable for speed, and additionally, more lightweight models like Mini LM (which takes roughly half the time) and even non-neural methods like Sentence2Vec can be used under speed and resource constraints. Across all initializations, we use the following parameters: K: 1500 k in ˆDKL: 5 Max iterations: 1000 Stopping Criterion: increase v_init: prev_opt Resets Allowed: false Iterations in Gradient Descent: 50 Gradient Descent Learning Rate: 0.01 For 0% initialization, we use a uniform start of 20 points, spread from -1 to 1 with 768 dimensions. For 25% and 50% initialization, we start with a random subsample of 375 and 750 clusters respectively, out of the 1500. The following is the quantization and main method signatures for 0% initialization: ... # Initialize class gio_kl = GIOKL.GIOKL() # Read data train, target = gio_kl.read_data_from_csv(PATH_TRAIN, PATH_TARGET) # Quantize data model_train, model_X, transform_train, transformed_X = gio_kl.quantize(train, target, k=1500) X = jnp.array(model_X.cluster Centers()) train = jnp.array(model_train.cluster Centers()) data = [(i, each.tolist()) for i, each in enumerate(model_train.cluster Centers())] centroids_df = gio_kl.spark.create Data Frame(data=data, schema=[
Software Dependencies No The paper mentions using Fairseq, PySpark, JAX, and Matplotlib but does not provide specific version numbers for these software dependencies. It only lists versions for specific pre-trained models like MPNet-Base-V2 and Mini LM-L12-v1.
Experiment Setup Yes We use Transformer Big from Vaswani et al. (2017), trained for 300k iterations with the same hyperparameters. We use the same processed WMT14 training data.