Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Detecting Click Fraud in Online Advertising: A Data Mining Approach

Authors: Richard Oentaryo, Ee-Peng Lim, Michael Finegold, David Lo, Feida Zhu, Clifton Phua, Eng-Yeow Cheu, Ghim-Eng Yap, Kelvin Sim, Minh Nhut Nguyen, Kasun Perera, Bijay Neupane, Mustafa Faisal, Zeyar Aung, Wei Lee Woon, Wei Chen, Dhaval Patel, Daniel Berrar

JMLR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We organized a Fraud Detection in Mobile Advertising (FDMA) 2012 Competition, opening the opportunity for participants to work on real-world fraud data from Buzz City Pte. Ltd., a global mobile advertising company based in Singapore. In particular, the task is to identify fraudulent publishers who generate illegitimate clicks, and distinguish them from normal publishers. The competition results provide a comprehensive study on the usability of data mining-based fraud detection approaches in practical setting. Our principal findings are that features derived from fine-grained timeseries analysis are crucial for accurate fraud detection, and that ensemble methods offer promising solutions to highly-imbalanced nonlinear classification tasks with mixed variable types and noisy/missing patterns.
Researcher Affiliation Collaboration Living Analytics Research Centre Singapore Management University, SAS Institute Pte. Ltd., Institute for Infocomm Research, Masdar Institute of Science and Technology, Indian Institute of Technology Roorkee, Tokyo Institute of Technology
Pseudocode Yes Algorithm 1 Backward feature elimination. Require: Features set F = {fi} and their ranks ri, minimum number of features m, train data T and validation data V
Open Source Code No The text does not provide a specific link to source code, nor does it contain an explicit statement that the code for the described methodology is being released or is available in supplementary materials. Mentions of URLs are for feature lists or competition data. For example: "A complete listing of the 118 features is available at http://clifton.phua.googlepages. com/feature-list.txt." and "The competition data remain available for further studies at http://palanteer.sis.smu.edu.sg/fdma2012/."
Open Datasets Yes We organized a Fraud Detection in Mobile Advertising (FDMA) 2012 Competition, opening the opportunity for participants to work on real-world fraud data from Buzz City Pte. Ltd., a global mobile advertising company based in Singapore. The competition data remain available for further studies at http://palanteer.sis.smu.edu.sg/fdma2012/.
Dataset Splits Yes To this end, Buzz City provides three sets of publishers and clicks data taken from different time periods: a training set (for building predictive model), a validation set (for model selection), and a test set (for evaluating the models generalization abilities and determining the competition winners). Each click data set captures the click traffic over a 3 day period, while each publisher data set records publishers receiving at least one click in that period. We summarize the count statistics of the publishers and clicks in Table 6. Data set: Train Time period: 9-11 Feb 2012. Data set: Validation Time period: 23-25 Feb 2012. Data set: Test Time period: 8-10 Mar 2012.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It mentions software packages like R's gbm but no hardware.
Software Dependencies No The paper mentions software packages like 'R s gbm package (Ridgeway, 2007)', 'R s randomForest package', 'WEKA (Hall et al., 2009)', 'LIBLINEAR framework (Fan et al., 2008)', and 'LIBSVM framework (Chang and Lin, 2011)'. However, it does not specify the version numbers for any of these software components or the R environment itself.
Experiment Setup Yes The final parameters used on the final training data set for our best average precision on the test data set are: distribution (loss function): bernoulli also tested Adaboost distribution n.trees (number of iterations): 5000 tested 100 to 5000 decision trees shrinkage (learning rate): 0.001 tested 0.001 to 0.01 interaction.depth (tree depth): 5 tested 2 to 5 n.minobsinnode (minimum observations in terminal node): 5 tested 2 to 5