{"id":530,"date":"2022-01-13T15:16:26","date_gmt":"2022-01-13T15:16:26","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/01\/13\/5-design-patterns-for-building-observable-services\/"},"modified":"2022-01-13T15:16:26","modified_gmt":"2022-01-13T15:16:26","slug":"5-design-patterns-for-building-observable-services","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/01\/13\/5-design-patterns-for-building-observable-services\/","title":{"rendered":"5 Design Patterns for Building Observable Services"},"content":{"rendered":"<p>How can you make your services observable and embrace service ownership? This article presents a variety of universally applicable design patterns for the developer to consider.<\/p>\n<p>Design patterns in software development are repeatable solutions and best practices for solving commonly occurring problems. Even in the case of service monitoring, design patterns, when used appropriately, can help teams embrace service ownership and troubleshoot their services in production. You can think of service monitoring design patterns in three categories:<\/p>\n<p><strong>Health Checks<br \/><\/strong>How do you know that your service is running\u200a\u2014\u200aand, if it is\u200a\u2014\u200aalso doing what it\u2019s supposed to be doing? Is it responding in a timely manner? Are there potential service issues that you can address before they affect customers?<strong>Real time Alerting<br \/><\/strong>When something does go wrong\u200a\u2014\u200asuch as your service becoming unresponsive, slowing to a crawl, or using too many resources\u200a\u2014\u200ado you have alerts that you configured to notify you of that\u00a0issue?<strong>Troubleshooting<br \/><\/strong>Has something gone wrong with your service? If it has, you probably need to know three things: when it happened, where it happened, and what caused it to happen. Use logs and traces to diagnose issues after they occur, and update your service so that those issues do not have lasting customer\u00a0impacts.<\/p>\n<p>Let\u2019s look at the patterns (and a few anti-patterns!) for each of these categories in\u00a0turn.<\/p>\n<h3>Health Checks<\/h3>\n<p>The two patterns for health checks are the <strong>outside-in health check<\/strong>, which verifies that your service is running and determines response time \/ latency from your service, and the <strong>inside-out health check<\/strong>, which keeps tabs on app and system metrics so that you can detect potential problems (including performance problems) before they cause an incident.<\/p>\n<h3>1. The Outside-In Health\u00a0Check<\/h3>\n<p>In this pattern, you ping your service endpoint using using a health-check service or a synthetic testing tool. We use a tool we built in-house, but other options include NewRelic, Gomez, and DataDog. Your service responds to the ping, which records the output of the check as a metric in a time-series metric system such as <a href=\"https:\/\/github.com\/salesforce\/Argus\">Argus<\/a> or whatever you\u2019re using for such a service. Once the data is available, you can visualize your service\u2019s health and other key metrics over a period of time and\/or opt for it to send you alerts when specific conditions are\u00a0met.<\/p>\n<p>The high-level pattern looks like\u00a0this.<\/p>\n<p>Diagram outlining the high-level pattern for outside-in health\u00a0checks.<\/p>\n<p>This pattern can be used to check two main groups of\u00a0metrics:<\/p>\n<p><strong>Uptime<\/strong>\u200a\u2014\u200aThis metric answers the time-honored question, \u201cis my service running, and is it doing what it is supposed to do?\u201d After the health-check invoker pings your service, if your service responds, it is up; if it doesn\u2019t respond, it\u2019s down, and you need to start remediation efforts<strong>User-perceived latency<\/strong>\u200a\u2014\u200aYou must check service latency from multiple locations around the globe. User-perceived latency might be different for a customer accessing your service from Japan, for example, than it is for customers accessing it from Spain and the United States. You can use the latency of the health-check API call as a proxy for user-perceived latency.<\/p>\n<p>The combination of the two groups above can be used to determine a synthetic-based availability signal for your service as\u00a0well.<\/p>\n<h3>2. The Inside-Out Health\u00a0Check<\/h3>\n<p>In this pattern, you use system and application metrics to detect potential issues before they result in service interruptions\u200a\u2014\u200afor your service and your customers!<\/p>\n<p>To determine whether there are underlying issues that might affect your service\u2019s performance or availability, collect metrics on application and the underlying infrastructure on which it is\u00a0running.<\/p>\n<p>Diagram outlining the high-level pattern of inside-out health\u00a0checks.<\/p>\n<p>So which application metrics should you be concerned with? At a minimum, collect these four\u00a0signals.<\/p>\n<p><strong>Request rate<\/strong>\u200a\u2014\u200aHow busy is your\u00a0service?<strong>Error rate<\/strong>\u200a\u2014\u200aAre there any errors in your service? If there are, how many, and how often do they\u00a0occur?<strong>Duration (of requests)<\/strong>\u200a\u2014\u200aHow long is it taking for your service to respond to requests?<strong>Saturation<\/strong>\u200a\u2014\u200aHow overloaded is your service? How much room does it have to\u00a0grow?<\/p>\n<p>In addition, gather these system metrics for each resource type (e.g. CPU, memory, disk space, IOPS) that your service\u00a0uses.<\/p>\n<p><strong>Utilization<\/strong>\u200a\u2014\u200aHow busy is this resource?<strong>Saturation<\/strong>\u200a\u2014\u200aHow utilized is the most constrained resource of the application which is impacting the application\u2019s performance \/ ability to function normally?<strong>Errors<\/strong>\u200a\u2014\u200aWhat is the resource\u2019s count of\u00a0errors?<\/p>\n<p>Collecting these metrics allows you to also compute the availability metric of the service, which answers the <em>critical<\/em> question\u200a\u2014\u200aIs my service or feature considered available to my customers? In most cases a combination of Uptime, Duration (latency) and Error Rate metrics can be used to compute the availability of the service in a manner that is representative of the customer\u2019s experience with the service. E.g. if you compute the availability of the service solely based upon uptime, then you will still consider yourself available even if queries take a long time to respond \/ web pages take a long time to load. Hence, taking additional metrics such as Error Rates and Duration are critical for an appropriate definition of Availability.<\/p>\n<h3>An Anti-Pattern: Modeling Metrics Using\u00a0Logs<\/h3>\n<p>Log data is great for troubleshooting, which we\u2019ll talk about soon. However, given the volume of log data that most applications generate these days, you really don\u2019t want to emit metrics using log data. Modeling metrics using log data is computationally expensive (which means expensive in monetary terms) and increases time to ingest, process and react which increases MTTD (mean time to detect). Neither of these outcomes are desirable. At Salesforce, we have teams that have built out efficient metrics-gathering tools for our developers to\u00a0use.<\/p>\n<h3>Real time\u00a0Alerting<\/h3>\n<p>Alerts and notifications are the primary ways in which you as a service owner would want to get notified when something is wrong with your service. You should configure alerts when key health and performance metrics of your service and infrastructure go above or below a specified threshold which would indicate an issue. E.g. when response time of the service suddenly increases beyond an acceptable threshold, when the availability of your service goes below the specified threshold such that it breaches the SLA of the\u00a0service.<\/p>\n<p>The goal of alerting is to notify a human or a remediation process to fix the issue and bring the service back to a healthy status so that it can continue to serve its customers. Although auto-remediation of an issue by kicking in an automated workflow upon getting alerted is desirable, not all issues (specially complex service issues) can be addressed as such. Typical scenarios where auto-remediation can be helpful include restarting the service to fix an issue, increase \/ decrease compute capacity in cloud environments, terminate an instance if there are any unauthorized ports open (security vulnerability).<\/p>\n<p>By using more sophisticated techniques (including machine learning), you can make the experience better for the service owners by alerting proactively and detecting anomalies. Proactive alerting allows you to take the remedial action before the issue impacts service health and the customers. E.g. it is better to get alerted that based upon current usage trends you will run out of disk capacity in a week as opposed to getting alerted when you are 90% disk full! Proactive alerting provides for more lead time to fix the issue and the alert need not result in a page at 3am in the morning to fix the issue. It can be a ticket that can be logged for the service owner to look at the issue the following morning. Detecting anomalies reduces alert noise and saves service owners from configuring static thresholds on metrics to get alerted on. The system will alert you when it detects an anomaly in the pattern for a given metrics. It will take seasonality into account and thus will reduce false positive alerts (noise) and automatically adjust the thresholds over a period of time and its definition of what constitutes an\u00a0anomaly.<\/p>\n<h3>3. Remediating Issues<\/h3>\n<p>The general pattern for alerting involves determining whether a problem can be auto-remediated, or whether some action needs to be taken by a human to avoid breaching your company\u2019s Service Level Agreement (SLA) with your customers. This process flow diagram incorporates both auto-remediation and remediation requiring human intervention.<\/p>\n<p>Flow diagram showing the decision process for determining whether to initiate auto remediation or invoke human\u00a0action.<\/p>\n<h3>Alerting Anti-Patterns<\/h3>\n<p><strong>Don\u2019t do glass watching. <\/strong>As systems scale and complexity increases, we cannot rely on humans to stare at monitors 24\/7 to look for service health trends and call someone to fix an issue if the threshold is breached. We need to rely on machines and algorithms notifying us when something goes wrong. This also takes human errors out of the equation. Automate as much of the process as possible.<strong>All alerts are not created equal. <\/strong>As such, don\u2019t treat all alerts the same way. Sending every small issue with the service as an alert to email\/Slack channel results in nothing more than spam and greatly reduces the signal-to-noise ratio. For every service issue, that may potentially result in a customer issue \/ SLA breach (critical issue), page the on-call engineer (using something like PagerDuty). Everything else should be logged as a ticket (if action is needed) or should be logged as a log\u00a0entry.<\/p>\n<h3>Troubleshooting<\/h3>\n<p>When something does go wrong with your service, you need information on what went wrong, when it went wrong, and where it went wrong. Code your service so that this kind of information is available to troubleshoot issues. There are two key ways to make that information available.<\/p>\n<p>Logging error conditions and related informationEnabling distributed tracing in your service (especially in microservice environments)<\/p>\n<h3>4. Logging Error Conditions<\/h3>\n<p>Logs are your best friends for finding out what went wrong with your service and when, so be sure to log your service\u2019s error conditions. Include a logging library in your service to capture app logs and send those logs to a logging service. At Salesforce, we use <a href=\"https:\/\/www.splunk.com\/\">Splunk<\/a>, but some other options might be DataDog, NewRelic, etc.<\/p>\n<p>When you troubleshoot, you can use both app and infrastructure logs to help you determine what caused an incident and how to reduce the chance of it recurring.<\/p>\n<p>The following diagram captures the high-level architecture of a logging\u00a0system.<\/p>\n<p>Diagram showing the high-level architecture of a logging\u00a0system.<\/p>\n<h3>5. Distributed Tracing for Microservices<\/h3>\n<p>In a microservices architecture, when an incident or performance degradation occurs, you need to know more than just what went wrong. You also need to know which of your microservices caused or contributed to the\u00a0issue.<\/p>\n<p><strong>Distributed tracing<\/strong> allows you to get that info by identifying each request with a requestID. When an issue occurs, you can pinpoint where it occurred in the request stream and whether the issue was associated with your service or with a dependent service. At Salesforce, we\u2019ve integrated all of our various applications with homegrown distributed tracing service (build on top of Zipkin), Tracer. Generated spans are sent to Tracer, and context is propagated to the downstream applications using <a href=\"https:\/\/github.com\/openzipkin\/b3-propagation\">B3 headers<\/a>. For instrumentation, we use either <a href=\"https:\/\/zipkin.io\/\">Zipkin<\/a> or <a href=\"https:\/\/opentelemetry.io\/\">OpenTelemetry<\/a> libraries.<\/p>\n<h3>So now\u00a0what?<\/h3>\n<p>By applying design patterns around health checks, alerting, and troubleshooting, you can build observable services from the ground up. There are many tools and additional resources out there to help you get started. One that we highly recommend is <a href=\"https:\/\/landing.google.com\/sre\/sre-book\/toc\/\">Google\u2019s <em>Site Reliability Engineering (SRE)<\/em> book<\/a>, particularly the \u201cMonitoring<strong>\u201d<\/strong> section in the <a href=\"https:\/\/landing.google.com\/sre\/sre-book\/chapters\/introduction\/\"><em>Introduction<\/em><\/a> and <a href=\"https:\/\/landing.google.com\/sre\/sre-book\/chapters\/monitoring-distributed-systems\/\"><em>Chapter 6: Monitoring Distributed Systems<\/em><\/a>. Also check out our <a href=\"https:\/\/engineering.salesforce.com\/tagged\/observability\">Observability 101 series<\/a> for even more on monitoring microservices.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/5-design-patterns-for-building-observable-services-d56e7a330419\">5 Design Patterns for Building Observable Services<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/5-design-patterns-for-building-observable-services-d56e7a330419?source=rss----cfe1120185d3---4\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>How can you make your services observable and embrace service ownership? This article presents a variety of universally applicable design patterns for the developer to consider. Design patterns in software development are repeatable solutions and best practices for solving commonly occurring problems. Even in the case of service monitoring, design patterns, when used appropriately, can&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/01\/13\/5-design-patterns-for-building-observable-services\/\">Continue reading <span class=\"screen-reader-text\">5 Design Patterns for Building Observable Services<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-530","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":539,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/02\/reads-service-health-metrics\/","url_meta":{"origin":530,"position":0},"title":"READS: Service Health Metrics","date":"February 2, 2022","format":false,"excerpt":"As you scale your company\u2019s software footprint, with Service-Oriented Architecture (SOA) or microservices architecture featuring 100s or 1000s of services or more, how do you keep track of the performance of every service in every region? How will you know whether you are tracking every service? Of course every service\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":589,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/02\/reads-service-health-metrics-2\/","url_meta":{"origin":530,"position":1},"title":"READS: Service Health Metrics","date":"February 2, 2022","format":false,"excerpt":"As you scale your company\u2019s software footprint, with Service-Oriented Architecture (SOA) or microservices architecture featuring 100s or 1000s of services or more, how do you keep track of the performance of every service in every region? How will you know whether you are tracking every service? Of course every service\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":591,"url":"https:\/\/fde.cat\/index.php\/2022\/04\/05\/transforming-service-reliability-through-an-slos-driven-culture-platform-2\/","url_meta":{"origin":530,"position":2},"title":"Transforming Service Reliability Through an SLOs-Driven Culture &amp; Platform","date":"April 5, 2022","format":false,"excerpt":"At Salesforce, Trust is our number-one value, and it has its own special meaning to each part of the company. In our Technology, Marketing, & Products (TMP) organization, a big part of Trust is providing highly reliable Salesforce experiences to our customers, which can be challenging because of the scale\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":561,"url":"https:\/\/fde.cat\/index.php\/2022\/04\/05\/transforming-service-reliability-through-an-slos-driven-culture-platform\/","url_meta":{"origin":530,"position":3},"title":"Transforming Service Reliability Through an SLOs-Driven Culture &amp; Platform","date":"April 5, 2022","format":false,"excerpt":"At Salesforce, Trust is our number-one value, and it has its own special meaning to each part of the company. In our Technology, Marketing, & Products (TMP) organization, a big part of Trust is providing highly reliable Salesforce experiences to our customers, which can be challenging because of the scale\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":619,"url":"https:\/\/fde.cat\/index.php\/2022\/08\/10\/architectural-principles-for-high-availability-on-hyperforce\/","url_meta":{"origin":530,"position":4},"title":"Architectural Principles for High Availability on Hyperforce","date":"August 10, 2022","format":false,"excerpt":"Infrastructure and software failures will happen. We idolize four 9s (99.99%) availability. We know we need to optimize and improve Recovery-Time-Objective (RTO, the time it takes to restore service after a service disruption) and Recovery-Point-Objective (RPO, the acceptable data loss measured in time). But how can we actually deliver high\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":877,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/11\/unlocking-the-power-of-mixed-reality-devices-with-mobileconfig\/","url_meta":{"origin":530,"position":5},"title":"Unlocking the power of mixed reality devices with MobileConfig","date":"June 11, 2024","format":false,"excerpt":"MobileConfig enables developers to centrally manage a mobile app\u2019s configuration parameters in our data centers. Once a parameter value is changed on our central server, billions of app devices automatically fetch and apply the new value without app updates. These remotely managed configuration parameters serve various purposes such as A\/B\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/530","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=530"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/530\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=530"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=530"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=530"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}