reproducibilityindex.ai

GIO: Gradient Information Optimization for Training Dataset Selection

Authors: Dante Everaert, Christopher Potts

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments with machine translation, spelling correction, and image recognition, we show that GIO delivers outstanding results with very small train sets.
Researcher Affiliation	Collaboration	Dante Everaert Amazon Search Science danteev@amazon.com dante.everaert@gmail.com Christopher Potts Stanford University cgpotts@stanford.edu
Pseudocode	Yes	Algorithm 1 Gradient Information Optimization
Open Source Code	Yes	We open source a pip-installable implementation of the algorithm as "pip install grad-info-opt".1 Also available at: https://github.com/daeveraert/gradient-information-optimization
Open Datasets	Yes	We first explore machine translation using the WMT14 dataset and Transformer-based models. In this case, G is the WMT14 dataset and X is the dev set.
Dataset Splits	Yes	For GIO and random, we cut 1,700 as validation and leave the remaining 15,000 (25%) as our training data. For the full model, we follow this percentage and cut 3,700 as validation and keep 56,300 as the training data.
Hardware Specification	Yes	We used AWS p3dn.24xlarge machines to generate the embeddings, which have 8 NVIDIA Tesla V100 GPUs and as a benchmark, takes roughly 4 hours to generate 15M embeddings (therefore approx. 8 hours for EN-FR and approx. 1 hour for EN-DE). This process is highly parallelizable for speed, and additionally, more lightweight models like Mini LM (which takes roughly half the time) and even non-neural methods like Sentence2Vec can be used under speed and resource constraints. Across all initializations, we use the following parameters: K: 1500 k in ˆDKL: 5 Max iterations: 1000 Stopping Criterion: increase v_init: prev_opt Resets Allowed: false Iterations in Gradient Descent: 50 Gradient Descent Learning Rate: 0.01 For 0% initialization, we use a uniform start of 20 points, spread from -1 to 1 with 768 dimensions. For 25% and 50% initialization, we start with a random subsample of 375 and 750 clusters respectively, out of the 1500. The following is the quantization and main method signatures for 0% initialization: ... # Initialize class gio_kl = GIOKL.GIOKL() # Read data train, target = gio_kl.read_data_from_csv(PATH_TRAIN, PATH_TARGET) # Quantize data model_train, model_X, transform_train, transformed_X = gio_kl.quantize(train, target, k=1500) X = jnp.array(model_X.cluster Centers()) train = jnp.array(model_train.cluster Centers()) data = [(i, each.tolist()) for i, each in enumerate(model_train.cluster Centers())] centroids_df = gio_kl.spark.create Data Frame(data=data, schema=[
Software Dependencies	No	The paper mentions using Fairseq, PySpark, JAX, and Matplotlib but does not provide specific version numbers for these software dependencies. It only lists versions for specific pre-trained models like MPNet-Base-V2 and Mini LM-L12-v1.
Experiment Setup	Yes	We use Transformer Big from Vaswani et al. (2017), trained for 300k iterations with the same hyperparameters. We use the same processed WMT14 training data.