reproducibilityindex.ai

Avoiding Plagiarism in Markov Sequence Generation

Authors: Alexandre Papadopoulos, Pierre Roy, François Pachet

AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We applied our algorithm on the Eugene Onegin corpus (2160 sentences, 6919 unique words, 32719 words in total). An interesting question is how likely a stochastic method ﬁnds sequences satisfying the maximum order property. As we mentioned at the beginning of the paper, it has been widely observed that high order Markov models tend to replicate large chunks of the corpus. Our model enables us to make exact solution counting (as a consequence of having arc-consistency). We ﬁrst count the total number S of Markovian sequences of length n = 20, with a Markov order 3, based on the Eugene Onegin corpus. We compare this to the number SL of Markovian sequences with a maximum order of L, with L ranging from 5 (the minimum non-trivially infeasible value), to 21 (the minimum value for which MAXORDER is trivially satisﬁed). We call solution loss the ratio 1 (SL/S): the closer it is to 1, the more Markovian sequences are ruled out because they do not satisfy the maximum order property for L. We show the results on Figure 3. Naturally, the constraint is tighter for low values of L. For L = 5 for example, there is no solution, leading to a solution loss of 1. For bigger values, the solution loss is still close to 1, with less that 1% solutions left.
Researcher Affiliation	Collaboration	1Sony CSL, 6 rue Amyot, 75005, Paris, France 2Sorbonne Universit es, UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris, France
Pseudocode	Yes	Algorithm 1: Markov automaton
Open Source Code	No	The paper does not provide any specific links to source code repositories, nor does it explicitly state that the code for their methodology is made publicly available.
Open Datasets	Yes	We applied our algorithm on the Eugene Onegin corpus (2160 sentences, 6919 unique words, 32719 words in total). An interesting question is how likely a stochastic method ﬁnds sequences satisfying the maximum order property. As we mentioned at the beginning of the paper, it has been widely observed that high order Markov models tend to replicate large chunks of the corpus.
Dataset Splits	No	The paper refers to a 'training phase' where probabilities are estimated from the corpus, but it does not specify any training/validation/test dataset splits or cross-validation methods for the corpus itself. The evaluation is on generated sequences against the corpus properties.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, cluster specifications) used to run the experiments.
Software Dependencies	No	The paper mentions conceptual algorithms and constraints (e.g., 'REGULAR constraint', 'Aho and Corasick string-matching algorithm', 'decomposition based method') and cites related works, but it does not provide specific version numbers for any software libraries, frameworks, or solvers used in implementation.
Experiment Setup	Yes	We use static variable ordering heuristics, in their temporal order. For a given variable Xi, we consider up to k previous variables. We then consider all the continuations in the corpus for the sequence Xi k, . . . , Xi 1 that are arc-consistent for MAXORDER. If the number of such continuations is above a certain ﬁxed threshold, we choose one, using the transition probabilities. Otherwise, the procedure is repeated for a smaller k, until one value is found. We start with k = L 1. We applied our algorithm on the Eugene Onegin corpus (2160 sentences, 6919 unique words, 32719 words in total). [...] n = 20, with a Markov order 3, based on the Eugene Onegin corpus. We compare this to the number SL of Markovian sequences with a maximum order of L, with L ranging from 5 (the minimum non-trivially infeasible value), to 21...