CONDUCTING RIGOROUS A/B TESTING FOR ML MODEL VARIANT ASSESSMENT
Keywords:
Models, Data Preparation, Cross-Validation, Metrics, Guardrails, Machine Learning Models, A/B TestingAbstract
This article talks about the best ways to do thorough A/B testing to compare different machine learning (ML) models. It talks about how important it is to use methods like randomization and stratified sampling to make sure that datasets are fair and accurate. The piece also talks about how to choose the right metrics and set up limits to keep model performance high. For fair model review, it shows how important it is to use holdout datasets and cross-validation methods. The article also talks about how important it is to protect data, work together with cross-functional teams, keep records, and keep an eye on machine learning models after they've been released. The piece backs up its suggestions with examples from real life and research studies.
References
A. Deng and X. Shi, "Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned," in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2016, pp. 77–86.
C. A. Gomez-Uribe and N. Hunt, "The Netflix Recommender System: Algorithms, Business Value, and Innovation," ACM Trans. Manage. Inf. Syst., vol. 6, no. 4, pp. 1–19, Dec. 2015.
S. Tadelis and S. Thomadsen, "Optimal Search Engine Design," Manage. Sci., vol. 65, no. 12, pp. 5619–5638, Dec. 2019.
Kaggle, "State of Data Science and Machine Learning 2020," Kaggle, Inc., San Francisco, CA, USA, 2020.
R. Kohavi, D. Tang, and Y. Xu, "Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing," Cambridge University Press, 2020.
T. Calders and S. Verwer, "Three naive Bayes approaches for discrimination-free classification," Data Min. Knowl. Discov., vol. 21, no. 2, pp. 277–292, Sep. 2010.
J. Buolamwini and T. Gebru, "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification," in Proc. 1st Conf. Fairness Account. Transpar., 2018, pp. 77–91.
S. Yadav and S. Shukla, "Analysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification," in Proc. IEEE 6th Int. Conf. Adv. Comput., 2016, pp. 78–83.
M. Chen et al., "Automating Econometric Modeling for Uber's Marketplace," in Proc. 27th ACM SIGKDD Conf. Knowl. Discov. Data Min., 2021, pp. 3492–3500.
R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne, "Controlled experiments on the web: survey and practical guide," Data Min. Knowl. Discov., vol. 18, no. 1, pp. 140–181, Feb. 2009.
W. Xu, A. Deng, and D. Liang, "Variance Reduction in Online Experiments," in Proc. Web Conf. 2021, 2021, pp. 3529–3539.
J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, "A survey on concept drift adaptation," ACM Comput. Surv., vol. 46, no. 4, pp. 1–37, Mar. 2014.
Gartner, "Gartner Identifies Three Key Factors That Will Impact AI Adoption Through 2024," Gartner, Inc., Stamford, CT, USA, 2021.
D. Agarwal, B. Long, J. Traupman, D. Xin, and L. Zhang, "LASER: A Scalable Response Prediction Platform for Online Advertising," in Proc. 7th ACM Int. Conf. Web Search Data Min., 2014, pp. 173–182.
T. Fawcett, "An introduction to ROC analysis," Pattern Recognition Letters, vol. 27, no. 8, pp. 861-874, Jun. 2006, doi: 10.1016/j.patrec.2005.10.010.
Y. Yue, R. Patel, and H. Roehrig, "Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data," in Proc. 19th Int. Conf. World Wide Web, 2010, pp. 1011-1018, doi: 10.1145/1772690.1772793.
R. Kohavi and R. Longbotham, "Online controlled experiments and A/B testing," in Encyclopedia of Machine Learning and Data Mining, 2nd ed., C. Sammut and G. I. Webb, Eds. New York, NY, USA: Springer, 2017,
pp. 922-929, doi: 10.1007/978-1-4899-7687-1_891.
D. Siroker and P. Koomen, A/B Testing: The Most Powerful Way to Turn Clicks Into Customers. Hoboken, NJ, USA: John Wiley & Sons, 2013.
D. Tang, A. Agarwal, D. O'Brien, and M. Meyer, "Overlapping experiment infrastructure: More, better, faster experimentation," in Proc. 16th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2010,
pp. 17-26, doi: 10.1145/1835804.1835810.
X. Amatriain and J. Basilico, "Recommender systems in industry: A Netflix case study," in Recommender Systems Handbook, 2nd ed., F. Ricci, L. Rokach, and B. Shapira, Eds. Boston, MA, USA: Springer, 2015,
pp. 385-419, doi: 10.1007/978-1-4899-7637-6_11.
E. Bakshy and D. Eckles, "Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods," in Proc. 19th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2013, pp.
-1311, doi: 10.1145/2487575.2488218.
A. Deng, J. Lu, and S. Chen, "Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing," in Proc. IEEE Int. Conf. Data Science and Advanced Analytics (DSAA), 2016, pp. 243-
, doi: 10.1109/DSAA.2016.33.
A. Deng, P. Zhang, S. Chen, D. Kim, and J. Lu, "Concise summarization of heterogeneous treatment effect using total variation regularized regression," arXiv preprint arXiv:1908.08482, 2019.
J. Stucchio, "Bayesian A/B testing at VWO," Visual Website Optimizer, 2015. [Online]. Available: https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technical_whitepaper.pdf
B. Letham, B. Karrer, G. Ottoni, and E. Bakshy, "Constrained Bayesian optimization with noisy experiments," Bayesian Analysis, vol. 14, no. 2, pp. 495-519, Jun. 2019, doi: 10.1214/18-BA1110.
J. Komiyama, J. Honda, and H. Nakagawa, "Optimal regret analysis of Thompson sampling in stochastic multi-armed bandit problem with multiple plays," in Proc. 32nd Int. Conf. Machine Learning, 2015, pp.
-1161.
A. Agarwal et al., "Making contextual decisions with low technical debt," arXiv preprint arXiv:1606.03966, 2016.
I. Guyon, "A Scaling Law for the Validation-Set Training-Set Size Ratio," AT&T Bell Lab., 1997.
R. Kohavi, "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection," in Proc. 14th Int. Joint Conf. Artif. Intell., 1995, pp. 1137–1143.
S. Agarwal, H. Daumé III, and S. Gerber, "Learning Multiple Tasks using Manifold Regularization," in Proc. 24th Int. Conf. Neural Inf. Process. Syst., 2010, pp. 46–54.
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY: Springer, 2009.
A. Kejariwal and C. Ré, "Machine Learning at Dropbox," in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2020, pp. 3485–3486.
J. Blitzer, M. Dredze, and F. Pereira, "Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification," in Proc. 45th Annu. Meet. Assoc. Comput. Linguist., 2007, pp. 440–
S. Arlot and A. Celisse, "A survey of cross-validation procedures for model selection," Stat. Surv., vol. 4, no. 0, pp. 40–79, 2010.
R. B. Rao, G. Fung, and R. Rosales, "On the Dangers of Cross-Validation. An Experimental Evaluation," in Proc. 2008 SIAM Int. Conf. Data Min., 2008, pp. 588–596.
M. Melis, A. Demontis, B. Biggio, G. Brown, G. Fumera, and F. Roli, "Is Deep Learning Safe for Robot Vision? Adversarial Examples against the iCub Humanoid," in Proc. IEEE Int. Conf. Comput. Vis. Workshop, 2017, pp. 751–759.
Ponemon Institute, "Cost of a Data Breach Report 2020," Ponemon Institute, Traverse City, MI, USA, 2020.
S. Peukert, J. Bechtold, M. Batikas, and B. Michalk, "European General Data Protection Regulation: An Empirical Analysis of Its Effects," Eur. J. Inf. Syst., vol. 29, no. 6, pp. 610–628, Nov. 2020.
P. Amershi et al., "Software Engineering for Machine Learning: A Case Study," in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng. Softw. Eng. Pract., 2019, pp. 291–300.
Salesforce, "The Impact of Collaboration on Business Performance," Salesforce Research, San Francisco, CA, USA, 2015.
M. Lin, J. Liu, T. Pan, and E. Botelho, "Building a Large-Scale Experimentation Platform for Online Services at Uber," in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2020, pp. 3487–3488.
D. Sculley et al., "Hidden Technical Debt in Machine Learning Systems," in Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, pp. 2503–2511.
A. Deng, Y. Xu, R. Kohavi, and T. Walker, "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data," in Proc. 6th ACM Int. Conf. Web Search Data Min., 2013, pp. 123–132.
M. Hardt, E. Price, and N. Srebro, "Equality of Opportunity in Supervised Learning," in Proc. 30th Int. Conf. Neural Inf. Process. Syst., 2016, pp. 3323–3331.
J. Buolamwini and T. Gebru, "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification," in Proc. 1st Conf. Fairness Account. Transpar., 2018, pp. 77–91.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Senthilbharanidhar BoganaVijaykumar (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.