Dealing with small data: On the generalization of context trees

Authors: Ralf Eggeling, Mikko Koivisto, Ivo Grosse

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By empirical studies both on simulated and realworld data, we demonstrate that the synergy of combining of both orthogonal approaches yields a substantial breakthrough in obtaining statistically efficient and computationally feasible generalizations of CTs.In Section 6 we evaluate the effect of the ideas on running time using both artificial and real data, and study the statistical efficiency of different CT generalizations for real data.
Researcher Affiliation Academia Ralf Eggeling EGGELING@INFORMATIK.UNI-HALLE.DE Martin Luther University Halle-Wittenberg, Germany Mikko Koivisto MIKKO.KOIVISTO@CS.HELSINKI.FI Helsinki Institute for Information Technology, Department of Computer Science, University of Helsinki, Finland Ivo Grosse GROSSE@INFORMATIK.UNI-HALLE.DE Martin Luther University Halle-Wittenberg, Germany German Center for Integrative Biodiversity Research (i Div) Halle-Jena-Leipzig, Leipzig, Germany
Pseudocode No The paper describes algorithms (e.g., dynamic programming) but does not include structured pseudocode or an algorithm block with a clear label.
Open Source Code No We have implemented the presented algorithms in Java based on the Jstacs framework (Grau et al., 2012) and conducted the experiments on a server with 2.4 GHz cores. This sentence indicates implementation but does not provide concrete access to the source code developed for this paper's methodology.
Open Datasets Yes We extract a data set of CTCF binding sites, which consists of 908 DNA sequences, from the Jaspar database (Sandelin et al., 2004)... and four additional data sets from the JASPAR database (Sandelin et al., 2004), namely DAF-12 from C. elegans, BZR1 and PIL5 from A. thaliana, and human NR2C2. In this study we use the alphabet reduction method of Li et al. (2003), since it offers for each possible reduced alphabet size an optimal clustering of amino acids into groups, and study several well-known proteins of different size and functionality, extracted from protein sequence database Uni Prot (The Uni Prot Consortium, 2013).
Dataset Splits Yes For all data sets and all structural variants, we compare the prediction performance using leave-one-out cross validation (Table 2).
Hardware Specification No We have implemented the presented algorithms in Java based on the Jstacs framework (Grau et al., 2012) and conducted the experiments on a server with 2.4 GHz cores. This provides a general description but lacks specific hardware details like CPU model, GPU, or memory.
Software Dependencies No We have implemented the presented algorithms in Java based on the Jstacs framework (Grau et al., 2012) and conducted the experiments on a server with 2.4 GHz cores. This mentions the language and a framework but no specific version numbers.
Experiment Setup Yes For learning inhomogeneous PMMs, which make a position-specific use of context trees, we use the best reported learning method in Eggeling et al. (2014b), that is, BIC (Schwarz, 1978) as structure score and fs NML (Silander et al., 2009) as parameter estimation method. For modeling DNA binding sites (d = 7, all values d < 7 are shown in Supplement).