LEVERAGING USER BEHAVIOR-BASED SERVICE LEVEL INDICATORS AND OBJECTIVES TO PREVENT AND MITIGATE SOFTWARE INCIDENTS

Karan Khanna

Authors

Karan Khanna San Jose State University, USA Author

Keywords:

User Behavior-Based SLIs/SLOs, Incident Prevention, Mitigation, Software Reliability Engineering, AI-Assisted Incident Management, DevOps, Site Reliability

Abstract

Incidents, defined as unplanned disruptions to software services, can lead to degraded quality and significant losses for organizations. By categorizing incidents based on triggers, type, actors, and impact, valuable insights can be gained to drive improvements. Service Level Indicators (SLIs) and Objectives (SLOs) are key metrics that reflect service health and reliability. However, the current approach of defining SLIs/SLOs in isolation from actual user behavior patterns limits their effectiveness in predicting and preventing user-impacting incidents. This paper proposes a novel approach to defining User Behavior-Based SLIs and SLOs that are directly mapped to key user journeys and features. By cascading these user-centric SLOs down to the underlying microservices and components, more relevant and proactive reliability targets can be set. Potential benefits include fewer incidents, improved Mean Time to Resolution (MTTR), better user experiences, and reduced operational risk. Integrating this approach with machine learning techniques also shows promise for AI-assisted incident forecasting and prevention.

References

R. Maurer, "The Cost of Downtime," Forbes, Jul. 2019. [Online]. Available: https://www.forbes.com/sites/forbestechcouncil/2019/07/30/the-cost-of-downtime/

"IT Downtime Costs $1.55M Per Year for the Average Business, Finds New Acronis Report," Acronis, Mar. 2020. [Online]. Available: https://www.acronis.com/en-us/blog/posts/it-downtime-costs-155m-year-average-business-finds-new-acronis-report/

B. Fung, "Here's why the Facebook outage was so bad," CNN Business, Oct. 2021. [Online]. Available: https://www.cnn.com/2021/10/05/tech/facebook-outage-explainer/index.html

T. Simonite, "How a Typo Took Down S3, the Backbone of the Internet," WIRED, Mar. 2017. [Online]. Available: https://www.wired.com/2017/03/typo-took-s3-backbone-internet/

N. Perlroth, "Okta Says Breach Exposed Customer Data," The New York Times, Apr. 2022. [Online]. Available: https://www.nytimes.com/2022/04/20/technology/okta-breach-customer-data.html

D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues et al., "Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems," in Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014, pp. 249–265.

K. Goseva-Popstojanova and A. J. Kale, "Analysis of the Impact of Software Faults on System Reliability and Uptime," in 2019 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), 2019, pp. 187-192.

G. Khatwani, X. Jin, N. Niu, A. Koshoffer, L. Newman, and J. Savolainen, "Advancing viewpoint merging in requirements engineering: A theoretical replication and explanatory study," Requirements Engineering, vol. 25, no. 2, pp. 143-167, 2020, doi: 10.1007/s00766-019-00312-1.

H. Ghanbari, T. Vartiainen, and M. Siponen, "Omission of quality software development practices: A systematic literature review," ACM Comput. Surv., vol. 51, no. 2, pp. 1-27, Feb. 2018, doi: 10.1145/3177746.

"Mobile load times," Think with Google. [Online]. Available: https://www.thinkwithgoogle.com/intl/en-154/marketing-strategies/app-and-mobile/mobile-load-time-statistics/

S. Lorin, "Caching for a Global Netflix," Medium, Dec. 17, 2018. [Online]. Available: https://netflixtechblog.com/caching-for-a-global-netflix-7bcc457012f1

C. Jones, "How SRE teams are organized, and how to get started," Google Cloud Blog, Jul. 16, 2019. [Online]. Available: https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started

"The Cost of Downtime," Gartner, Jul. 16, 2014. [Online]. Available: https://blogs.gartner.com/andrew-lerner/2014/07/16/the-cost-of-downtime/

J. Schooff, "The State of Electronic Prescriptions in the U.S.," Health IT Outcomes, Feb. 7, 2019. [Online]. Available: https://www.healthitoutcomes.com/doc/the-state-of-electronic-prescriptions-in-the-u-s-0001

"Uptime and downtime with 99% availability," Statuspage. [Online]. Available: https://support.atlassian.com/statuspage/docs/uptime-and-downtime-with-99-percent-availability/

J. Nielsen, "Website Response Times," Nielsen Norman Group, Jun. 21, 2010. [Online]. Available: https://www.nngroup.com/articles/website-response-times/

A. Bouch, A. Kuchinsky, and N. Bhatti, "Quality is in the Eye of the Beholder: Meeting Users' Requirements for Internet Quality of Service," in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '00), Apr. 2000, pp. 297-304, doi: 10.1145/332040.332447.

S. C. Seow, "Designing and Engineering Time: The Psychology of Time Perception in Software," Addison-Wesley Professional, 2008.

T. Kadlec, "What is Perceived Performance?" Cloudfare, Mar. 16, 2021. [Online]. Available: https://blog.cloudflare.com/what-is-perceived-performance/

"Evernote's Journey Toward User-Centric Reliability Engineering," New Relic, Oct. 29, 2020. [Online]. Available: https://newrelic.com/resources/customer-stories/evernote-sre-journey

"Stripe's Approach to Service Level Objectives," Stripe Engineering, Jan. 23, 2019. [Online]. Available: https://stripe.com/blog/service-level-objectives

"Salesforce Delivers Exceptional User Experiences with the Help of SRE," Google Cloud Blog, Aug. 6, 2020. [Online]. Available: https://cloud.google.com/blog/products/devops-sre/salesforce-delivers-exceptional-user-experiences-with-the-help-of-sre

J. Silber, "Etsy's Approach to Monitoring and Alerting," The Etsy Blog, May 1, 2018. [Online]. Available: https://codeascraft.com/2018/05/01/etsys-approach-to-monitoring-and-alerting/

"State of Digital Operations Report," PagerDuty, 2021. [Online]. Available: https://www.pagerduty.com/resources/reports/state-of-digital-operations/

B. Beyer, N. R. Murphy, D. K. Rensin, K. Kawahara, and S. Thorne, "The Site Reliability Workbook: Practical Ways to Implement SRE," O'Reilly Media, 2018.

"2020 State of DevOps Report," Atlassian, 2020. [Online]. Available: https://www.atlassian.com/whitepapers/devops-survey-2020

"Improving Incident Management with SLOs at Microsoft Azure," Microsoft Azure Blog, Mar. 18, 2021. [Online]. Available: https://azure.microsoft.com/en-us/blog/improving-incident-management-with-slos-at-microsoft-azure/

"How Shopify Improved Its Site Reliability and Customer Experience," Shopify Engineering Blog, Sep. 25, 2019. [Online]. Available: https://engineering.shopify.com/blogs/engineering/how-shopify-improved-site-reliability-customer-experience

"Intercom's Journey to Reliable Messaging," Intercom Blog, Jun. 10, 2020. [Online]. Available: https://www.intercom.com/blog/intercoms-journey-to-reliable-messaging/

"How Airbnb Measures Site Reliability," Airbnb Engineering & Data Science, Aug. 11, 2020. [Online]. Available: https://medium.com/airbnb-engineering/how-airbnb-measures-site-reliability-12a0e3a5e5a6

"Driving Organizational Alignment with User-Centric Metrics," Google Cloud Blog, Apr. 14, 2021. [Online]. Available: https://cloud.google.com/blog/topics/devops-sre/driving-organizational-alignment-with-user-centric-metrics

"2021 State of DevOps Report," Puppet Labs, 2021. [Online]. Available: https://puppet.com/resources/report/2021-state-of-devops-report/

"Introducing Atlas: Netflix's Primary Telemetry Platform," Netflix Technology Blog, Feb. 3, 2021. [Online]. Available: https://netflixtechblog.com/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a

"How Stripe Ensures Infrastructure Reliability During Peak Traffic," Stripe Engineering Blog, Jan. 28, 2021. [Online]. Available: https://stripe.com/blog/infrastructure-reliability-peak-traffic

"Monitoring at Goldman Sachs: Scaling Systems and the User-Centric Approach," InfoQ, Nov. 23, 2020. [Online]. Available: https://www.infoq.com/articles/monitoring-goldman-sachs/

S. Bhatnagar, A. Gangal, and V. Pandya, "Prism: An ML Framework for Anomaly Detection and Analysis at Netflix," Netflix Technology Blog, Apr. 8, 2021. [Online]. Available: https://netflixtechblog.com/prism-an-ml-framework-for-anomaly-detection-and-analysis-at-netflix-3c4005e4c00a

C. Gao, S. Chen, and J. Zhou, "Argos: Predictive Alerting for Business Metrics at Uber," Uber Engineering Blog, Aug. 20, 2020. [Online]. Available: https://eng.uber.com/argos/

"Predicts 2022: Artificial Intelligence and Machine Learning in DevOps," Gartner, Dec. 8, 2021. [Online]. Available: https://www.gartner.com/en/documents/4009131/predicts-2022-artificial-intelligence-and-machine-learni

LEVERAGING USER BEHAVIOR-BASED SERVICE LEVEL INDICATORS AND OBJECTIVES TO PREVENT AND MITIGATE SOFTWARE INCIDENTS

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

cover