Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification
Authors: Mozhi Zhang, Yoshinari Fujinuma, Jordan Boyd-Graber9547-9554
AAAI 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments confirm that characterlevel knowledge transfer is more data-efficient than word-level transfer between related languages. |
| Researcher Affiliation | Academia | Mozhi Zhang CS and UMIACS University of Maryland College Park, MD, USA EMAIL Yoshinari Fujinuma Computer Science University of Colorado Boulder, CO, USA EMAIL Jordan Boyd-Graber CS, i School, LSC, and UMIACS University of Maryland College Park, MD, USA EMAIL Now at Google Research Zurich |
| Pseudocode | No | The paper describes the model architecture and training process in text and with a diagram (Figure 1), but it does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide any explicit statements about making its source code available or links to a code repository. |
| Open Datasets | Yes | Our first dataset is Reuters multilingual corpus (RCV2), a collection of news stories labeled with four topics (Lewis et al. 2004)... We build a second CLDC dataset with famine-related documents sampled from Tigrinya (TI) and Amharic (AM) LORELEI language packs (Strassel and Tracey 2016). |
| Dataset Splits | No | For each language, we sample 1,500 training documents and 200 test documents with balanced labels. No explicit mention of a separate validation set. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions "Adam (Kingma and Ba 2015) with default settings" as the optimizer, but does not specify versions for other software dependencies or libraries like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | We use three ReLU layers with 100 hidden units and 0.1 dropout for the CLWE-based DAN models and the DAN classifier of the CACO models. The BI-LSTM embedder uses ten dimensional character embeddings and forty hidden states with no dropout. The outputs of the embedder are forty dimensional word embeddings. We set λd to 1, λe to 0.001, and λp to 1 in the multi-task objective (Equation 11). ... All models are trained with Adam (Kingma and Ba 2015) with default settings. We run the optimizer for a hundred epochs with mini-batches of sixteen documents. |