A Theoretical Analysis of the Repetition Problem in Text Generation

Authors: Zihao Fu, Wai Lam, Anthony Man-Cho So, Bei Shi12848-12856

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results show that our theoretical framework is applicable in general generation models and our proposed rebalanced encoding approach alleviates the repetition problem significantly in both the translation task and the language modeling task.
Researcher Affiliation Collaboration Zihao Fu,1 Wai Lam,1 Anthony Man-Cho So,1 Bei Shi2 1Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong 2AI Lab, Tencent
Pseudocode Yes Algorithm 1 Rebalanced Encoding Algorithm def learn RE(words : list, N : int, gamma : float): merges = [] for step in range(N): id_to_word = list(set(words)) word_to_id = {w:i for i,w in enumerate(id_to_word)} M = numpy.zeros([len(id_to_word), len(id_to_word)]) for i in range(len(words) 1): M[word_to_id[words[i]], word_to_id[words[i+1]]]+=1 M = M / M.sum(1).reshape(-1,1).clip(1) if M.max() <= gamma: break merges += [(id_to_word[i1], id_to_word[i2]) for i1, i2 in zip(*(M > gamma).nonzero())] words = apply RE(words, merges) return merges def apply RE(words : list, merges : list): for merge in merges: for i in range(len(words) 1): if tuple(words[i : i + len(merge)]) == merge: words[i : i + len(merge)] = [ ==.join(merge).replace("@@==", "")] i -= 1 return words
Open Source Code Yes The source code of this paper can be obtained from https://github.com/fuzihaofzh/repetition-problem-nlg.
Open Datasets Yes We adopt the widely used IWSLT 14 English-German dataset containing 160K sentences pairs. ... We use the Wiki-103 dataset (Merity et al. 2017) and encode the text with byte pair encoding with subword units around 10,000.
Dataset Splits No The paper uses widely known datasets (IWSLT 14, Wiki-103) which have standard splits, and refers to Appendix A.6 for hyper-parameters, but it does not explicitly provide the specific training, validation, or test dataset splits (e.g., percentages or sample counts) in the main text.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions 'Python code' for Algorithm 1 and refers to Transformer and fairseq for model architectures, but it does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, fairseq, CUDA).
Experiment Setup No The paper states: 'The details of hyper-parameters settings are presented in Appendix A.6 (Fu et al. 2020b).' However, this appendix is not provided in the given text, meaning the specific experimental setup details are not available within the analyzed content.