reproducibilityindex.ai

Learning to Rap Battle with Bilingual Recursive Neural Networks

Authors: Dekai Wu, Karteek Addanki

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We used freely available user generated hip hop lyrics on the Internet to provide training data for our experiments. The processed corpus contained 22 million tokens with 260,000 verses and 2.7 million lines of hip hop lyrics. As human evaluation is expensive, a small subset of 85 lines was chosen as the test set to provide challenges for comparing the quality of responses generated by different systems. We followed the evaluation scheme proposed by Addanki and Wu [2014] as it achieved very encouraging inter-evaluator agreements despite the high degree of subjectivity of the evaluation task. The output of both the baseline and our model, was given to three independent frequent hip hop listeners familiar with freestyle rap battling for manual evaluation. They were asked to evaluate the system outputs according to ﬂuency and the degree of rhyming. They were free to choose the tune to make the lyrics rhyme, as the beats of the song were not used in the training data. Each evaluator was asked to score the response of each system on the criterion of ﬂuency and rhyming as being good, acceptable or bad. Table 1 shows the average fraction of sentences rated good and acceptable for each model. Compared to the phrase-based SMT (PBSMT) baseline, our TRAAM model produces signiﬁcantly higher percentage of good and acceptable rhyming responses.
Researcher Affiliation	Academia	Dekai Wu and Karteek Addanki Human Language Technology Center Department of Computer Science and Engineering Hong Kong University of Science and Technology {dekai vskaddanki}@cs.ust.hk
Pseudocode	No	The paper describes algorithms using mathematical equations and prose (e.g., Training algorithm, Decoding algorithm), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not include any statements about releasing source code or provide links to a code repository for the described methodology.
Open Datasets	No	The paper states: 'We used freely available user generated hip hop lyrics on the Internet to provide training data for our experiments.' and 'The processed corpus contained 22 million tokens with 260,000 verses and 2.7 million lines of hip hop lyrics.' It does not provide a specific link, DOI, or formal citation (with authors and year for the dataset itself) for accessing this corpus, only mentioning its origin from 'the Internet'.
Dataset Splits	No	The paper mentions a training corpus ('around 200,000 lines of challenge response pairs') and a test set ('a small subset of 85 lines'), but does not explicitly provide details about a distinct validation set or the specific percentages/counts for training, validation, and test splits. It states: 'The weights of the feature scores were determined empirically observing the performance on a small subset of the test data.'
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments.
Software Dependencies	No	The paper mentions the use of 'SRILM [Stolcke, 2002]' and 'Moses baseline [Koehn et al., 2007]', but it does not specify version numbers for these or any other software dependencies, which is required for reproducibility.
Experiment Setup	Yes	We use a single layer with a nonlinear activation function (tanh) similar to the monolingual recursive autoencoder [Socher et al., 2011]...The loss function is deﬁned as a linear combination (with the linear weighting factor α) of the L2 norm of the reconstruction error of the children and the cross-entropy loss of reconstructing the permutation order...A regularization parameter λ is used on the norm of the model parameters θ, to avoid overﬁtting...L-BFGS algorithm is used in order to minimize the loss function. The feature score along with the transduction grammar and LM score is used to score each hypothesis using a weighted linear combination.