Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Global Mixup: Eliminating Ambiguity with Clustering
Authors: Xiangjin Xie, Li Yangning, Wang Chen, Kai Ouyang, Zuotong Xie, Hai-Tao Zheng
AAAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments for CNN, LSTM, and BERT on five tasks show that Global Mixup outperforms previous baselines. Further experiments also demonstrate the advantage of Global Mixup in low-resource scenarios. |
| Researcher Affiliation | Collaboration | 1Shenzhen International Graduate School, Tsinghua University 2Google Inc. 3Pengcheng Laboratory |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found in the paper. |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the methodology described. |
| Open Datasets | Yes | We conduct experiments on five benchmark text classification tasks and table 1 summarizes the statistical characteristics of the five datasets: 1. YELP: (Yelp 2015), which is a subset of Yelp s businesses, reviews, and user data. 2. SUBJ: (Pang and Lee 2004), which aims to classify the sentences as subjectivity or objectivity. 3. TREC: (Li and Roth 2002), is a question dataset with the aim of categorizing a question into six question types. 4. SST-1: (Socher et al. 2013), is Stanford Sentiment Treebank, five categories of very positive, positive, neutral, negative, and very negative, Data comes from movie reviews and emotional annotations. 5. SST-2: (Socher et al. 2013), is the same as SST-1 but with neutral reviews removed and binary labels, Data comes from movie reviews and emotional annotations. |
| Dataset Splits | Yes | Data Split: We randomly select a subset of training data with N = {500, 2000, 5000} to investigate the performance in few-sample scenario of Global Mixup. Table 1: Summary for the datasets c: number of target labels. N: number of samples. V: valid set size. T: test set size. W means no standard valid split was provided. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU/CPU models or memory. |
| Software Dependencies | Yes | All models are implemented with Pytorch (Paszke et al. 2019) and Python 3.7. |
| Experiment Setup | Yes | For the λ Beta(α, α) parameters, we tune the α from {0.5, 1, 2, 4, 8}. And to demonstrate the effectiveness of Global Mixup on a larger space, we extend λ [ 0.3, 1.3] with uniform distribution. We set the number of samples generated per training sample pair T from{2, 4, 8, 16, 20, 32, 64} and the best performance is obtained when T = 8 is selected. The batch size is chosen from{32, 50, 64, 128, 256, 500}and the learning rate from{1e-3, 1e-4, 4e-4, 2e-5}. For the hyperparameter setting, we set θ from{1/c, 0.5, 0.6, 0.8, 0.9, 1}, c is the number of target labels. γ from {1, 2, 4, 6}, τ and η from {1/T, 1}, ϵ = 1e-5, δ = 1. For the reinforced selector, we use Adam optimizer (Kingma and Ba 2015) for CNN and LSTM, Adam W (Loshchilov and Hutter 2017) for BERT. |