{"id":479,"date":"2021-09-29T15:46:06","date_gmt":"2021-09-29T15:46:06","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2021\/09\/29\/switch-it-up\/"},"modified":"2021-09-29T15:46:06","modified_gmt":"2021-09-29T15:46:06","slug":"switch-it-up","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/09\/29\/switch-it-up\/","title":{"rendered":"Switch It Up!"},"content":{"rendered":"<p>It is vital that a microservice striving for high availability have at its disposal several choices on how to rollback changes when unforeseen production issues occur. A very interesting article comes to mind from <a href=\"https:\/\/www.oreilly.com\/content\/generic-mitigations\/\">O\u2019Reilly, <strong><em>Generic Mitigations<\/em><\/strong><\/a>, where the theme is to restore service functionality FIRST and FAST\u200a\u2014\u200aand THEN root cause the real fix as a follow up. The top goal is to minimize user impact by using \u201cpre-canned\u201d rollback mitigations\u200a\u2014\u200aones that were practiced and that the team can apply virtually blindfolded\u200a\u2014\u200aand employ them fast, restoring service the soonest. Software code changes make up most new deployments, and so the focus discussed below is on dealing with errant code changes, typically around tech debt improvements.<\/p>\n<h3>The Problem<\/h3>\n<p>When a service rolls out changes, it usually combines a group of pull requests (PRs) together into in a release bundle, which is then deployed into production. And when trouble starts, precious time is spent investigating performance data and log and error output, as well as reviewing code PRs to determine the issue\u200a\u2014\u200aall the while the users are being affected. Following the Generic Mitigations pattern, good students of these teachings will invoke a ready mitigation (ex. a code rollback) and restore the service\u200a\u2014\u200aall good, right!? Sort of; the user is able to transact, the service is back, but now more engineering resources and time need to be spent sorting out which PR did what to ultimately provide a follow on fix. Let\u2019s face it: as microservice pattens flourish and become more prevalent, the system becomes more integrated and complex, and therefore it\u2019s harder to rollout one flawless update after\u00a0another.<\/p>\n<h3>A Solution To Reduce The\u00a0Chaos<\/h3>\n<p>Typically, in any given release there is usually a code change that fixes tech debt or updates existing service flows\u200a\u2014\u200aand when they regress and break, the rollout goes bad, users and the service are affected and the above problem repeats. One way to help mitigate these type of changes, is the idea of a <strong><em>code switch<\/em><\/strong>, where we wrap the existing code and the newly updated code in a simple IF test, one that is controlled by a code switch. This example pseudo-code is offered\u00a0below.<\/p>\n<p>If codeSwitch.isEnabled(metaServiceAuthN)<br \/>   \/\/ execute new code &#8211; create and sign token and send to meta API<br \/>else<br \/>   \/\/ execute existing code &#8211; get token from IDP and sent to meta API<\/p>\n<p>Here, one can see that, if the code switch is enabled, the service uses its new AuthN mechanism to call an outside API. If the code switch is disabled, the existing AuthN, one that has been working for months is executed, as if the new code update and deployment never happened. The choice is made at runtime, providing the opportunity for changing the code switch on the fly, without any code rollbacks, service restarts, or\u00a0updates.<\/p>\n<p>Now, of course, there are numerous ways to do this type of pattern, one where new functionality can be tried out on a limited set of customers\u200a\u2014\u200athink A\/B testing, or SpringProfiles, or Canary patterns, etc. Baeldung.com has an excellent write up on this family of feature switch technologies <a href=\"https:\/\/www.baeldung.com\/spring-feature-flags\">here<\/a>. All of which are excellent, and should be pursued, and their pros and cons considered. However, if you have a new service or have limited developer resources, those robust service capabilities are probably still on the drawing board. This is where the simplicity of code switches comes in to offer. For a light lift (on the order of a few sprints), you can build a feature switch pattern that is tailored for cross-service deployment failure mitigation.<\/p>\n<h3>The Build\u00a0Out<\/h3>\n<p>As we built this out, in our environment of Amazon AWS, Kubernetes (K8s), Java SpringData JPA and a single Relational Database Management System (RDBMS) centrally backing all the workloads, the database was the obvious choice to store the code switch such that it could be shared across K8s workloads. As a side note, we also wanted to avoid service restarts to affect the change, so this ruled out an app.properties approach.<\/p>\n<p><strong>code_switch Table<\/strong><br \/>    name     &#8211; varchar(100)<br \/>    enabled  &#8211; boolean<br \/>    info     &#8211; varchar(100)<\/p>\n<p>A simple enum was employed to keep order with naming and storage consistency.<\/p>\n<p>enum CodeSwitchEnum {<br \/>  NONE(&#8220;none&#8221;, &#8220;none&#8221;, false), TEST(&#8220;test&#8221;, &#8220;this is a test&#8221;, true),<br \/>  META_SERVICE_AUTHN(&#8220;meta_service_authn&#8221;, &#8220;none&#8221;, false);  String name;<br \/>  String info;<br \/>  boolean initialEnableState;<\/p>\n<p>And finally, a simple method to check isEnabled(), which does a DB lookup of the\u00a0value.<\/p>\n<p>boolean isEnabled(CodeSwitchEnum switchName) {<br \/>        codeSwitch = codeSwitchRepository.findById(toggleName)<br \/>        return codeSwitch.getIsEnabled();<\/p>\n<p>Immediately, we recognized the need to leverage the RDBMS cache on these values and specifically avoid using the JPA cache. If the JPA cache is part of the isEnabled() call, and the database field was updated at the SQL prompt or via an Admin API call, we would need to cache flush all JPA caches for each workload. which would be challenging. By using the RDBMS cache directly, anytime the field changes in the DB on the next read from any workload, the call would flow all the way through to the database to get the updated value and update the cache. This proved to be the simplest way to allow real-time database updates, forcing the app to re-read the new value. Another iteration of this could have been to work Redis into the isEnabled() flow, which offers some improvement, but is somewhat limited\u200a\u2014\u200athe DB cache is good enough as a\u00a0start!<\/p>\n<p>Putting it all together, here we can see how, depending on the position of the code switch, a different code path will run. In the case of the switch being enabled, path two executes. Conversely, if the switch is disabled, path one executes. On first read, the first workload will need to go all the way to DB disk, but thereafter, each other workload calling isEnable(), will pull the code switch boolean from the RDBMS shared cache. Upon change of the code switch state, the RDMBS will flush the cache, and all workloads will get the newest value on the next read, which makes it a nice real-time switch.<\/p>\n<p><strong>Initialization\u00a0\u2026 Bring out the Provisioner!!<\/strong><\/p>\n<p>We next considered what should happen the first time a new code switch is added; is it disabled or enabled by default? How is the initial state and the code switch added to the database\u200a\u2014\u200avia a database migration or during the service start up sequence? If we chose the DB migration path, then, for each code switch, we would need to create a new YAML description, and DB migrate after we added the enum to the Java file. Adding the field initialEnabledState to the enum was the most straightforward way, as it made it simpler for the developer to create the code switch in one\u00a0place.<\/p>\n<p>enum CodeSwitchEnum {<br \/>  boolean initialEnableState;<\/p>\n<p>Next, we chose to provision the code switch at startup, using a simple @PostConstruct annotation in our Admin workload. Here the code switch provisioner will loop through all the enums, doing a DB lookup for existence. If provisioned previously, the switch is already in the DB, so NEXT! If not, then it will be added to the DB with its initial state set. All the while, the isEnabled() call will return a false (disabled) on all conditions unless it actually reads the value in the DB (or its cache). This alleviates any sort of timing issue on first initialization if a different workload reaches the isEnabled() check before it is provisioned. Lastly, the Admin provisioner loop is built such that the first admin workload to perform the DB insert wins, and the others will fail gracefully\u200a\u2014\u200ano harm, no\u00a0foul.<\/p>\n<p>\/\/ ** will run once on startup<br \/>@PostConstruct<br \/>public void autoCreateCodeSwitches() {  for (CodeSwitchEnum enumItem : CodeSwitchEnum.values()) {<br \/>    if (enumItem == CodeSwitchEnum.NONE)<br \/>      continue;    if (Objects.<em>nonNull<\/em>(codeSwitchService.getCsInfo(enumItem))<br \/>       continue; \/\/ already in DB, NEXT!    \/\/ not found, create and insert in DB<br \/>    CodeSwitchInfo codeSwitchInfo = CodeSwitchInfo.builder()<br \/>         .name(enumItem.getValue())<br \/>         .metaval(enumItem.getMetaval())<br \/>        .isEnabled(enumItem.getInitialEnableState()).build();<br \/>      codeSwitchService.createCodeSwitch(codeSwitchInfo);      <br \/>  }<br \/>  log.info(&#8220;autoCreateCodeSwitches done.&#8221;);}<\/p>\n<p><strong>API Management<\/strong><\/p>\n<p>In addition to the workhorse of isEnable(), we crafted a group of management methods and APIs to handle management of the code switches, all protected with best API security practices.<\/p>\n<p>CodeSwitchInfo    createCodeSwitch(CodeSwitchInfo codeSwitchInfo);<br \/>CodeSwitchInfo       readCodeSwitchInfo(String name);<br \/>List&lt;CodeSwitchInfo&gt; readCodeSwitchInfos();<br \/>CodeSwitchInfo       updateInfoval(CodeSwitchInfo codeSwitchInfo);<br \/>void                 setEnableState(String name, boolean state);<br \/>void                 deleteCodeSwitch(String name);<\/p>\n<p>GET \/code-switch\/{name}    read code switch<br \/>GET \/code-switch\/switches  read list of ALL code switches<br \/>PUT \/code-switch  create code switch (pre-req: enum exists in code)<br \/>PUT \/code-switch\/{name}    update code switch infoval<br \/>DEL \/code-switch\/{name}           delete code switch<br \/>GET \/code-switch\/{name}\/enable    enable code switch<br \/>GET \/code-switch\/{name}\/disable   disable code switch<\/p>\n<h3>Usage and\u00a0Outcome<\/h3>\n<p>For our service, we have used this technique several times on areas of tech debt transition where there is the potential of \u201cservice breakage.\u201d We are careful to not use this as a means to enable or disable customer features, because this begins a slippery slope of customers being able to choose which features they want or not. That would make it a nightmare to maintain the code and would fragment customer feature sets. Again, the context for using this technique is for a new service, a team just looking to evolve their service, or to keep uplifting tech debt, with a small team of developers that needs to be able to fallback\u00a0fast.<\/p>\n<p>We started with the default value of DISABLED for our first few code switches. Then, when the service owners were ready to enable the code switch, we did so, such that we had all of our monitoring instrumentation at the ready. Recall that I mentioned a good use of code switches is around areas that could break critical user flow functionality\u200a\u2014\u200aso we wanted to be\u00a0ready!<\/p>\n<p>For the act of changing the code switch, we have an Admin API that only allows service operators (with the right security and credentials) to authenticate and make the change to the enabled state. We immediately started to see different logging and monitoring metrics showing that the new code path (see path two above) was running. So far so good. After another 20 mins, all appeared to be nominal, and we declared the production change a success, all the while knowing that, if there was any hint of concern, we could quickly revert back to code path one very fast. Which brings us back to the tenants of Generic Mitigations: <strong><em>minimize user impact and rollback\u00a0fast!<\/em><\/strong><\/p>\n<p>In closing, clearly there are many more sophisticated patterns out there that could be the next level in a service that is following an iterative approach of continuous improvement. Code Switches as described here offer a relatively simple yet useful tool to go about tech debt service improvements in a lightweight way with the option to revert\u00a0quickly.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/switch-it-up-42cebf95546d\">Switch It Up!<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/switch-it-up-42cebf95546d?source=rss----cfe1120185d3---4\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>It is vital that a microservice striving for high availability have at its disposal several choices on how to rollback changes when unforeseen production issues occur. A very interesting article comes to mind from O\u2019Reilly, Generic Mitigations, where the theme is to restore service functionality FIRST and FAST\u200a\u2014\u200aand THEN root cause the real fix as&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/09\/29\/switch-it-up\/\">Continue reading <span class=\"screen-reader-text\">Switch It Up!<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-479","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":781,"url":"https:\/\/fde.cat\/index.php\/2023\/11\/06\/sre-weekly-issue-397\/","url_meta":{"origin":479,"position":0},"title":"SRE Weekly Issue #397","date":"November 6, 2023","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, FireHydrant: Incident management platform FireHydrant is combining alerting and incident response in one ring-to-retro tool. Sign up for the early access waitlist and be the first to experience the power of alerting + incident response in one platform at last. https:\/\/firehydrant.com\/signals\/ Modern\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":662,"url":"https:\/\/fde.cat\/index.php\/2022\/12\/14\/how-salesforce-uses-immutable-infrastructure-in-hyperforce\/","url_meta":{"origin":479,"position":1},"title":"How Salesforce uses Immutable Infrastructure in Hyperforce","date":"December 14, 2022","format":false,"excerpt":"Credits go to: Armin Bahramshahry, Software Engineering Principal Architect @ Salesforce\u00a0&\u00a0Shan Appajodu, VP, Software Engineering for Developer Productivity Experiences @ Salesforce. To leverage the scale and agility of the world\u2019s leading public cloud platforms, our Technology and Products team at Salesforce has worked together over the past few years to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":897,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/16\/ai-lab-the-secrets-to-keeping-machine-learning-engineers-moving-fast\/","url_meta":{"origin":479,"position":2},"title":"AI Lab: The secrets to keeping machine learning engineers moving fast","date":"July 16, 2024","format":false,"excerpt":"The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers. AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A\/B test common ML workflows \u2013 enabling proactive improvements and automatically preventing regressions on TTFB.\u00a0\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":528,"url":"https:\/\/fde.cat\/index.php\/2022\/01\/06\/managing-availability-in-service-based-deployments-with-continuous-testing\/","url_meta":{"origin":479,"position":3},"title":"Managing Availability in Service Based Deployments with Continuous Testing","date":"January 6, 2022","format":false,"excerpt":"The Problem At Salesforce, trust is our number one value. What this equates to is that our customers need to trust us; trust us to safeguard their data, trust that we will keep our services up and running, and trust that we will be there for them when they need\u00a0us.\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":777,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/24\/automating-dead-code-cleanup\/","url_meta":{"origin":479,"position":4},"title":"Automating dead code cleanup","date":"October 24, 2023","format":false,"excerpt":"Meta\u2019s Systematic Code and Asset Removal Framework (SCARF) has a subsystem for identifying and removing dead code. SCARF combines static and dynamic analysis of programs to detect dead code from both a business and programming language perspective. SCARF automatically creates change requests that delete the dead code identified from the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":610,"url":"https:\/\/fde.cat\/index.php\/2022\/07\/20\/how-meta-and-the-security-industry-collaborate-to-secure-the-internet\/","url_meta":{"origin":479,"position":5},"title":"How Meta and the security industry collaborate to secure the internet","date":"July 20, 2022","format":false,"excerpt":"Bug hunting is hard and can sometimes go unnoticed across our industry. Building scalable bug detection methods across large codebases and open source libraries is an underappreciated yet critical effort every engineering company has to work through. Because the ideal outcome is that bugs are found and fixed before they\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/479","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=479"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/479\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=479"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=479"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=479"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}