Weakly-Supervised Grammar-Informed Bayesian CCG Parser Learning

Authors: Dan Garrette, Chris Dyer, Jason Baldridge, Noah Smith

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated our approach on the three available CCG corpora: English CCGBank (Hockenmaier and Steedman 2007), Chinese Treebank CCG (Tse and Curran 2010), and the Italian CCG-TUT corpus (Bos, Bosco, and Mazzei 2009). Each corpus was split into four non-overlapping datasets: a portion for constructing the tag dictionary, sentences for the unlabeled training data, development trees (used for tuning α, pterm, pmod, and pfwd hyperparameters), and test trees. We used the same splits as Garrette et al. (2014).
Researcher Affiliation Academia Department of Computer Science, University of Texas at Austin, dhg@cs.utexas.edu School of Computer Science, Carnegie Mellon University, {cdyer,nasmith}@cs.cmu.edu Department of Linguistics, University of Texas at Austin, jbaldrid@utexas.edu
Pseudocode Yes Borrowing from the recursive generative function notation of Johnson, Griffiths, and Goldwater (2007), our process can be summarized as: Parameters: σ Dirichlet(ασ, σ0) root categories θt Dirichlet(αθ, θ0) t T binary productions πt Dirichlet(απ, π0) t T unary productions µt Dirichlet(αµ, µ0 t) t T terminal productions λt Dir( 1, 1, 1 ) t T production mixture Sentence: s Categorical(σ) generate(s) where function generate(t) : z Categorical(λt) if z = 1 : u, v | t Categorical(θt) Tree(t, generate(u), generate(v)) if z = 2 : u | t Categorical(πt) Tree(t, generate(u))) if z = 3 : w | t Categorical(µt) Leaf(t, w)
Open Source Code No The paper does not provide any explicit statement or link for open-source code for the methodology described.
Open Datasets Yes We evaluated our approach on the three available CCG corpora: English CCGBank (Hockenmaier and Steedman 2007), Chinese Treebank CCG (Tse and Curran 2010), and the Italian CCG-TUT corpus (Bos, Bosco, and Mazzei 2009).
Dataset Splits Yes Each corpus was split into four non-overlapping datasets: a portion for constructing the tag dictionary, sentences for the unlabeled training data, development trees (used for tuning α, pterm, pmod, and pfwd hyperparameters), and test trees. We used the same splits as Garrette et al. (2014).
Hardware Specification No The acknowledgments state: "Experiments were run on the UTCS Mastodon Cluster, provided by NSF grant EIA-0303609." While a specific cluster is named, no details about the CPU, GPU, memory, or other specific hardware components of this cluster are provided.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup Yes For the category grammar, we used pterm=0.7, pmod=0.1, pfwd=0.5. For the priors, we use ασ=1, αθ=100, απ=10,000, αµ=10,000.