Train-Attention: Meta-Learning Where to Focus in Continual Knowledge Learning
Authors: Seo Yeongbin, Dongha Lee, Jinyoung Yeo
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments conducted on both newly introduced and established CKL benchmarks, TAALM proves the state-of-the-art performance upon the baselines... 4 Experiment: We conduct experiments on two benchmarks. One is our newly designed LAMA-CKL, and the other is the established benchmark, TEMPORALWIKI [Jang et al., 2022]. |
| Researcher Affiliation | Academia | Yeongbin Seo Dongha Lee Jinyoung Yeo Department of Artificial Intelligence Yonsei University {suhcrates,donalee,jinyeo}@yonsei.ac.kr |
| Pseudocode | Yes | Algorithm 1: Optimization of Train-Attention |
| Open Source Code | Yes | The code and the dataset will be available online2 [https://github.com/ybseo-ac/TAALM] |
| Open Datasets | Yes | We also introduce a new CKL benchmark, LAMA-CKL... We experiment on LAMA-CKL and previous CKL benchmark (Temporal Wiki [Jang et al., 2022])... The code and the dataset will be available online2 [https://github.com/ybseo-ac/TAALM] |
| Dataset Splits | Yes | Of the 4166 train data, 100 are used for validation. |
| Hardware Specification | Yes | 8 RTX 3090 GPU (24GB) are used, with a global batch size of 64. A single A100 (82GB) GPU is used, and the effect of batch size 16 is achieved through gradient accumulation. |
| Software Dependencies | No | The paper mentions models and frameworks (e.g., Llama2-7B, Tiny Llama-1.1B, QLoRA, Adam W optimizer) and programming languages (e.g., Python implied by code base) but does not specify version numbers for general software dependencies like PyTorch, CUDA, or specific Python libraries. |
| Experiment Setup | Yes | Learning rate 1e-4, Adam W optimizer, and max length of 512 tokens are applied. A total of 30 epochs took 25 minutes of GPU time. We utilize Llama2-7B integrated with QLo RA [Dettmers et al., 2024] as a base model... We employ Lo RA r = 64, α = 16, NF4 with BF16 computation datatype. |