Avoiding Plagiarism in Markov Sequence Generation

Authors: Alexandre Papadopoulos, Pierre Roy, François Pachet

AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We applied our algorithm on the Eugene Onegin corpus (2160 sentences, 6919 unique words, 32719 words in total). An interesting question is how likely a stochastic method finds sequences satisfying the maximum order property. As we mentioned at the beginning of the paper, it has been widely observed that high order Markov models tend to replicate large chunks of the corpus. Our model enables us to make exact solution counting (as a consequence of having arc-consistency). We first count the total number S of Markovian sequences of length n = 20, with a Markov order 3, based on the Eugene Onegin corpus. We compare this to the number SL of Markovian sequences with a maximum order of L, with L ranging from 5 (the minimum non-trivially infeasible value), to 21 (the minimum value for which MAXORDER is trivially satisfied). We call solution loss the ratio 1 (SL/S): the closer it is to 1, the more Markovian sequences are ruled out because they do not satisfy the maximum order property for L. We show the results on Figure 3. Naturally, the constraint is tighter for low values of L. For L = 5 for example, there is no solution, leading to a solution loss of 1. For bigger values, the solution loss is still close to 1, with less that 1% solutions left.
Researcher Affiliation Collaboration 1Sony CSL, 6 rue Amyot, 75005, Paris, France 2Sorbonne Universit es, UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris, France
Pseudocode Yes Algorithm 1: Markov automaton
Open Source Code No The paper does not provide any specific links to source code repositories, nor does it explicitly state that the code for their methodology is made publicly available.
Open Datasets Yes We applied our algorithm on the Eugene Onegin corpus (2160 sentences, 6919 unique words, 32719 words in total). An interesting question is how likely a stochastic method finds sequences satisfying the maximum order property. As we mentioned at the beginning of the paper, it has been widely observed that high order Markov models tend to replicate large chunks of the corpus.
Dataset Splits No The paper refers to a 'training phase' where probabilities are estimated from the corpus, but it does not specify any training/validation/test dataset splits or cross-validation methods for the corpus itself. The evaluation is on generated sequences against the corpus properties.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, cluster specifications) used to run the experiments.
Software Dependencies No The paper mentions conceptual algorithms and constraints (e.g., 'REGULAR constraint', 'Aho and Corasick string-matching algorithm', 'decomposition based method') and cites related works, but it does not provide specific version numbers for any software libraries, frameworks, or solvers used in implementation.
Experiment Setup Yes We use static variable ordering heuristics, in their temporal order. For a given variable Xi, we consider up to k previous variables. We then consider all the continuations in the corpus for the sequence Xi k, . . . , Xi 1 that are arc-consistent for MAXORDER. If the number of such continuations is above a certain fixed threshold, we choose one, using the transition probabilities. Otherwise, the procedure is repeated for a smaller k, until one value is found. We start with k = L 1. We applied our algorithm on the Eugene Onegin corpus (2160 sentences, 6919 unique words, 32719 words in total). [...] n = 20, with a Markov order 3, based on the Eugene Onegin corpus. We compare this to the number SL of Markovian sequences with a maximum order of L, with L ranging from 5 (the minimum non-trivially infeasible value), to 21...