A Gradient Accumulation Method for Dense Retriever under Memory Constraint
Authors: Jaehee Kim, Yukyung Lee, Pilsung Kang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on widely used five information retrieval datasets indicate that CONTACCUM can surpass not only existing memory reduction methods but also high-resource scenario. |
| Researcher Affiliation | Academia | 1Seoul National University 2Boston University {jaehee_kim, pilsung_kang}@snu.ac.kr ylee5@bu.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | All codes and links to download the datasets are included in the supplemental material. Additionally, we plan to release the codes for reproducibility of the main experimental results after the review process to preserve anonymity. |
| Open Datasets | Yes | The datasets used for the experiments were Natural Questions (NQ) [18], Trivia QA [15], Curated TREC (TREC) [1], and Web Questions (Web Q) [2] processed by DPR and MS Marco [26]. |
| Dataset Splits | Yes | The optimal memory bank size, Nmemory, was selected using evaluation data with candidates [128, 512, 2048], resulting in 2,048 for NQ and 512 for Trivia QA. For MS Marco, Web Q, and TREC, due to the lack of evaluation data, Nmemory were set based on dataset size: 1,024 for MS Marco, and 128 for Web Q and TREC. |
| Hardware Specification | Yes | All experiments were conducted on a single A100 80GB GPU. For high-resource scenario, we considered situations where 80GB of memory is available. For low-resource settings, we assumed available memory as widely used commercial GPUs: 11GB (GTX-1080Ti), 24GB (RTX-3080Ti, RTX-4090Ti). |
| Software Dependencies | No | The experimental code was adapted from nano-DPR3, which provides a simplified training and evaluation pipeline for DPR. All experiments were conducted using the BERT4 [6] model. For retrieval, we used the FAISS [14] library to perform exact nearest neighbor search with default hyperparameters. Using the torch.cuda.set_per_process_memory_fraction function in Py Torch [27] allows for restricting the memory used during training, regardless of the total available memory. |
| Experiment Setup | Yes | The hyperparameters for training were set as follows: the warmup step was 1,237 steps, weight decay was set to 0, and a customized scheduler with a linear decay of the learning rate after the warmup was used. The optimizer was Adam W [23] with epsilon set to 1e-8, and the learning rate was 2e-5. Gradient clipping was applied at a value of 2.0, and τ was set to 1. |