Leveraging Common Structure to Improve Prediction across Related Datasets
Authors: Matt Barnes, Nick Gisolfi, Madalina Fiterau, Artur Dubrawski
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental Results The artificial data sets in Fig. 1 illustrate how spurious samples negatively affect the placement of a linear SVM decision boundary for a binary classification task. We consider an oracle model trained on samples from the common distribution only (no spurious points). On the other hand, there is the baseline model, which is the result of training a linear SVM on all the data including all spurious samples. The presence of spurious samples shifts this linear decision boundary slightly, thus, the baseline divides the classes in a way which misclassifies some samples from the default distribution, decreasing the accuracy compared to the oracle. We trained a model after each iteration of the greedy spurious sample removal to illustrate its effect. Then, we bootstrapped to entire process to obtain an average accuracy and a 95% confidence interval, shown in Fig 1. We found that, as we removed more samples, the clipped model performance approached the oracle, with tighter confidence intervals, thus the removal of spurious samples is indeed beneficial. Now, let us consider a nuclear threat detection system, built for determining whether a vehicle that passes through customs emits signatures consistent with radioactive material. In Figure 2, we depict the most informative 2D projection, where a non-trivial density mismatch manifests for datasets generated with different simulation parameters. Threats are shown in red, normal samples shown in green. Figure 2 shows the blue circled spurious samples removed. The baseline we used (M0) is trained on all data. Our approach produces a clipped version of DS1 which we added to DS2 to obtain the alternative model M1. We test M0 and M1 on all other datasets. Additionally, we enhance our approach with the use of a gating function. That is, the model to be used in classification is determined by picking the model (M0 or M1) with the smallest Renyi divergence to the test set. We refer to this gated model as M2. The justification for this is that some testing datasets can have spurious samples that are close enough to the ones in the original datasets, so it makes sense to use these samples, when beneficial. The gated version outperforms the other two as it benefits from sample removal when the incoming datasets do not have spurious samples, as shown in Table 1. |
| Researcher Affiliation | Academia | Matt Barnes mbarnes1@cs.cmu.edu Carnegie Mellon University Pittsburgh, PA 15213 Nick Gisolfi ngisolfi@cmu.edu Carnegie Mellon University Pittsburgh, PA 15213 Madalina Fiterau mfiterau@cs.cmu.edu Carnegie Mellon University Pittsburgh, PA 15213 Artur Dubrawski awd@cs.cmu.edu Carnegie Mellon University Pittsburgh, PA 15213 |
| Pseudocode | No | The paper describes its procedure in text but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code (e.g., a specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described. |
| Open Datasets | No | The paper mentions 'artificial data sets' and 'nuclear threat datasets DS1 and DS2' but does not provide concrete access information (e.g., specific link, DOI, repository name, or formal citation) for any publicly available or open dataset. |
| Dataset Splits | No | The paper mentions 'training sets' and 'test set' but does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide any specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using a 'linear SVM' but does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | No | The paper does not contain specific experimental setup details such as concrete hyperparameter values, training configurations, or system-level settings in the main text. |