Understanding and Exploiting Language Diversity
Authors: Fausto Giunchiglia, Khuyagbaatar Batsuren, Gabor Bella
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The UKC contains 2,802,811 ambiguity instances across its pool of 335 languages, These instances were automatically generated and then given in input to the algorithm which, in turn, generated 908,110 candidate polysemes and 594,115 candidate homonyms across all languages. A sample of 640 cases, half being candidate homonyms and half being candidate polysemes, were randomly selected, which were equally divided across seven languages belonging to six different phyla (English, Hindi, Hungarian, Korean, Kazakh, Chinese, Arabic). Seven native speakers were selected as evaluators. All the evaluators, though not being linguists by training, had previously had some exposure to Word Net. They were provided with the glosses of the concepts involved, they were asked the follwing question: Do you think meanings c1 and c2 of word w are related?", and they had to provide a yes/no answer. Table 8 provides statistics and accuracy values for each of the languages evaluated. The average accuracy for finding polysemes is 98.3%... |
| Researcher Affiliation | Academia | Fausto Giunchiglia, Khuyagbaatar Batsuren, Gabor Bella DISI, University of Trento, Italy |
| Pseudocode | Yes | Algorithm 1: Lexical Ambiguity Classification |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology it describes. |
| Open Datasets | No | The paper describes the Universal Knowledge Core (UKC) as the resource used for experiments, stating it is "populated via the import of freely available resources", but does not provide any specific link, citation, or clear statement confirming the UKC itself is publicly available for access. |
| Dataset Splits | Yes | We have learned the parameters (λ, β, TD, TS, TM) using a training set of 173 polysemes and 146 homonyms from three phyla. A sample of 640 cases, half being candidate homonyms and half being candidate polysemes, were randomly selected, which were equally divided across seven languages belonging to six different phyla... |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments. |
| Software Dependencies | No | The paper mentions existing lexical resources like WordNet and BabelNet as sources for populating the UKC, but does not specify any software dependencies or libraries with version numbers used for the implementation or experiments. |
| Experiment Setup | Yes | The grid has been built by taking, for each parameter, an increment of 0.1 within the following ranges: λ = [1.2; 4.0] (higher values favour more phyla in the language set), TD = [1.0, 10.0] (the higher the value the more diversity is required for polysemy and homonymy detection), TS = [0.3, 1.7] (the lower the value the more similarity is allowed for homonymy), β = [0.0; 1.5] (the lower the less relative significance of geographic diversity), TM = [0.5, 0.8]. |