reproducibilityindex.ai

Dealing with small data: On the generalization of context trees

Authors: Ralf Eggeling, Mikko Koivisto, Ivo Grosse

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By empirical studies both on simulated and realworld data, we demonstrate that the synergy of combining of both orthogonal approaches yields a substantial breakthrough in obtaining statistically efﬁcient and computationally feasible generalizations of CTs.In Section 6 we evaluate the effect of the ideas on running time using both artiﬁcial and real data, and study the statistical efﬁciency of different CT generalizations for real data.
Researcher Affiliation	Academia	Ralf Eggeling EGGELING@INFORMATIK.UNI-HALLE.DE Martin Luther University Halle-Wittenberg, Germany Mikko Koivisto MIKKO.KOIVISTO@CS.HELSINKI.FI Helsinki Institute for Information Technology, Department of Computer Science, University of Helsinki, Finland Ivo Grosse GROSSE@INFORMATIK.UNI-HALLE.DE Martin Luther University Halle-Wittenberg, Germany German Center for Integrative Biodiversity Research (i Div) Halle-Jena-Leipzig, Leipzig, Germany
Pseudocode	No	The paper describes algorithms (e.g., dynamic programming) but does not include structured pseudocode or an algorithm block with a clear label.
Open Source Code	No	We have implemented the presented algorithms in Java based on the Jstacs framework (Grau et al., 2012) and conducted the experiments on a server with 2.4 GHz cores. This sentence indicates implementation but does not provide concrete access to the source code developed for this paper's methodology.
Open Datasets	Yes	We extract a data set of CTCF binding sites, which consists of 908 DNA sequences, from the Jaspar database (Sandelin et al., 2004)... and four additional data sets from the JASPAR database (Sandelin et al., 2004), namely DAF-12 from C. elegans, BZR1 and PIL5 from A. thaliana, and human NR2C2. In this study we use the alphabet reduction method of Li et al. (2003), since it offers for each possible reduced alphabet size an optimal clustering of amino acids into groups, and study several well-known proteins of different size and functionality, extracted from protein sequence database Uni Prot (The Uni Prot Consortium, 2013).
Dataset Splits	Yes	For all data sets and all structural variants, we compare the prediction performance using leave-one-out cross validation (Table 2).
Hardware Specification	No	We have implemented the presented algorithms in Java based on the Jstacs framework (Grau et al., 2012) and conducted the experiments on a server with 2.4 GHz cores. This provides a general description but lacks specific hardware details like CPU model, GPU, or memory.
Software Dependencies	No	We have implemented the presented algorithms in Java based on the Jstacs framework (Grau et al., 2012) and conducted the experiments on a server with 2.4 GHz cores. This mentions the language and a framework but no specific version numbers.
Experiment Setup	Yes	For learning inhomogeneous PMMs, which make a position-speciﬁc use of context trees, we use the best reported learning method in Eggeling et al. (2014b), that is, BIC (Schwarz, 1978) as structure score and fs NML (Silander et al., 2009) as parameter estimation method. For modeling DNA binding sites (d = 7, all values d < 7 are shown in Supplement).