{"id":494,"date":"2021-10-21T15:27:24","date_gmt":"2021-10-21T15:27:24","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2021\/10\/21\/connector-framework-a-generic-approach-to-crawl-activities-in-real-time\/"},"modified":"2021-10-21T15:27:24","modified_gmt":"2021-10-21T15:27:24","slug":"connector-framework-a-generic-approach-to-crawl-activities-in-real-time","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/10\/21\/connector-framework-a-generic-approach-to-crawl-activities-in-real-time\/","title":{"rendered":"Connector framework: A generic approach to crawl activities in real-time"},"content":{"rendered":"<p><em>Authors: Jayanth Parayil Kumarji, Evan Jiang, Kevin Terusaki, Zhidong Ke, Heng Zhang, Jeff Lowe, Yifeng Liu, Priyadarshini Mitra<\/em><\/p>\n<h3>Introduction<\/h3>\n<p><a href=\"https:\/\/www.salesforce.com\/products\/sales-cloud\/overview\/\">Sales Cloud<\/a> empowers our customers to make quick and well-informed decisions, with all of the tools they need to manage their selling process. The features we work on in the Activity Platform team, spanning Sales Cloud Einstein, Einstein Analytics, Einstein Conversation Insights, and Salesforce Inbox, rely on a real-time supply of activity data. Building a robust and reliable data pipeline can be complex because it\u2019s based on the user\u2019s business rules. In an effort to speed development, we\u2019ve created a framework for our business use case that is easily extensible to create a custom real-time ingestion pipeline. In this blog post, we\u2019ll explore its fundamentals, take a look at our design choices framework, and examine the high-level architecture of this framework that makes it versatile.<\/p>\n<h3>Background &amp; Terminology<\/h3>\n<p>Activities in the context of this topic are of different types, including email, meeting, video call, voice call, etc. The provider\/vendor of such activities have a few distinct characteristics:<\/p>\n<p>They provide a way to fetch a given activity using a unique identifier.They let you subscribe to any new activity or changes to existing activity.They allow some form of range\u00a0query.<\/p>\n<p>Some of the popular vendors\/providers we support include Gmail, Gcal, O365, and Exchange.<\/p>\n<p>For the scope of this discussion, we\u2019ll focus on email activity. Each mailbox to be crawled is tracked internally as a first-class <strong><em>Datasource<\/em><\/strong> object. Each datasource entry tracks all configurations required to communicate with the vendor including the subscription lease metadata. Customers have control over the Datasources they provided and can disconnect\/reconnect on demand. Thus, Datasources can be in different states based on various customer actions. As a result, managing this lifecycle of the datasource and respecting its state becomes part of the crawling\u00a0system.<\/p>\n<h3>Breaking up the\u00a0problem<\/h3>\n<p>In a nutshell, the problem we are trying to solve is that we need to be able to capture any change in an external system to data that we are interested in and propagate\/introduce this into our system. Let\u2019s dive into the problem space at a high level, and, at each level, define the pieces needed to formulate our solution:<\/p>\n<p>SystemsComponentsSubcomponents<\/p>\n<h4>Systems<\/h4>\n<p>At a high level, the process of crawling an activity can be broken into two major\u00a0systems.<\/p>\n<p><strong>Webhook<\/strong>\u200a\u2014\u200aThis is a public-facing microservice that accepts incoming notification events and queues the notification for the downstream component to consume. The webhook is a fairly simple process microservice that can be placed behind a load balancer and horizontally scaled.<strong>Processing system<\/strong>\u200a\u2014\u200aA processing unit translates the received notifications to their corresponding activities. The system manages the lifecycle of the datasource and tracks any operational metadata.<\/p>\n<h4>Components<\/h4>\n<p>Let\u2019s expand on the processing system. This system can be broken down into multiple components, each with a specific responsibility.<\/p>\n<p><strong>Fanout<\/strong>\u200a\u2014\u200aThis component accepts a given notification and fans-out to a number of internal actionable notifications. The fanout queues messages for the collector to\u00a0consume.<strong>Collector<\/strong>\u200a\u2014\u200aThe collector is responsible for gathering an activity based on the translated internal notification generated by the Fanout, such as fetching a new email, gathering the RSVP list from a meeting, etc. This component is <a href=\"https:\/\/www.pcmag.com\/encyclopedia\/term\/io-intensive\">I\/O intensive<\/a>, and an unreachable external service can have a cascading effect. To mitigate this, the component works with at-most-once semantics. The collector is also responsible for recording all messages it has crawled in a given timeframe. It does this by adding the successfully collected activity Id to a set data structure for a given time\u00a0bucket.<strong>Verifier<\/strong>\u200a\u2014\u200aThe verifier\u2019s job is to periodically check for missed data and crawl them. Contrary to the collector, which relies on a push-style operation model, the verifier is a pull-style operation model. This helps us account for any events we miss due to transient errors or cases when we never received the notification itself.<strong>Refresher<\/strong>\u200a\u2014\u200aThis component manages the lifecycle of the datasource by preemptively refreshing the auth tokens for the datasource and refreshing the subscription lease if\u00a0needed.<strong>Onboarder<\/strong>\u200a\u2014\u200aThis special component is used specifically to crawl historical activities. Ideally, the onboarder is triggered the first time the datasource is connected to gather all historical data. Once the historical data is crawled, the datasource is enabled for real-time crawling.<\/p>\n<p>Apart from its component-specific responsibility, each component is also responsible for handling errors and updating the state of the datasource if it observes it as invalid. This allows us to weed out notifications\/operations that correspond to invalid data sources. To avoid any messages being dropped as they pass across component boundaries, we recommend using some form of a persistent queuing system. Each component, after acting on a message\/operation for a datasource, checkpoints with what we call the <strong>component watermark<\/strong>. The component watermark gives us a glimpse of the operational data for a given datasource and helps us track failures at different stages of the pipeline. Each component is further responsible for reporting its metrics and response\u00a0time.<\/p>\n<p>If we implement each of these components on any stream\/micro-batch framework, we\u2019ll have a functioning pipeline capable of capturing any changes in activities from the external provider.<\/p>\n<h4>Subcomponents<\/h4>\n<p>Implementing these components allows us to stream one type of activity from a given vendor. This may work well for most cases where we have a single vendor and a specific activity type. However, at Salesforce, we need to support different types of activities while supporting various vendors. To make this more extensible, we split each component into reusable subcomponents. Imagine each of the subcomponents as a Lego block; re-packaging them in a different manner allows you to create a component. This approach makes it easy to work with new types of activities and also activities from different vendors.<\/p>\n<p>Let\u2019s dig into the building blocks of each component and its\u00a0purpose:<\/p>\n<p><strong>Spouts<\/strong>: This is a borrowed term from the <a href=\"https:\/\/storm.apache.org\/\">Apache Storm<\/a> framework. This represents the source of a data stream and is essentially a wrapper around the underlying dataset, e.g. a <a href=\"https:\/\/storm.apache.org\/releases\/current\/javadocs\/org\/apache\/storm\/kafka\/spout\/KafkaSpout.html\">KafkaSpout<\/a> that reads data from Kafka. This subcomponent allows you to read data into the component.<strong>Readers<\/strong>: This construct helps interface with the specific external activity provider. We need a specific reader for each combination of vendor-activity.<strong>Mappers<\/strong>: These are responsible for mapping any of the external constructs to internal objects, e.g. mapping an email create time to our internal time\u00a0bucket<strong>Emitter<\/strong>: This subcomponent allows us to transfer and publish data from a component. Similar to the reader subcomponent, the emitter can also have different flavors, such as a <a href=\"https:\/\/docs.confluent.io\/platform\/current\/clients\/producer.html\">KafkaEmitter<\/a> to let components queue messages to a specific Kafka\u00a0topic<strong>Recorder<\/strong>: This is a unique subcomponent and is used to track individual actions\/operations\/data for a given component. As an example, the collector component uses a recorder to track a list of activities crawled for a datasource within a specific time\u00a0frame.<strong>Transformers<\/strong>: Unlike mappers, which transform an external representation to an internal one, transformers convert one internal representation to\u00a0another.<strong>Checkpointer<\/strong>: This allows each component to checkpoint on completion of their action on a datasource. The checkpointer is responsible for updating the component watermark for a datasource and also for updating the datasource state on consecutive failures.<\/p>\n<p>Armed with the above information, we can build most of our components by gluing these subcomponents together in various arrangements.<\/p>\n<p>Each subcomponent can come in various flavors based on the vendor and activity type. If tackled correctly, each subcomponent can be made further configurable to avoid re-engineering.<\/p>\n<h3>Conclusion<\/h3>\n<p>This framework has been the cornerstone for us to capture activities and data of similar nature. With this generic framework, we are able to significantly reduce development time when deploying a new activity capture pipeline for a vendor. The framework not only provides us with a generic approach to capture activities but also unifies the way we track and monitor such systems. This makes it easier to catch issues early on and alert based on component-specific rules. Although we mainly focused on activity data, this is by no means the limit. As of this writing, we have built real-time robust data pipelines capable of capturing email, calendar, contact, and video call information from different vendors. We hope to improve this framework with our continued learning about new external systems, observations of different ingested data-shapes, and examinations of the operational challenges we face over time. If you\u2019d like to use this framework, let us know on Twitter at <a href=\"https:\/\/medium.com\/u\/41ea9b1cdc2b\">@SalesforceEng<\/a>! Based on community interest, we may open source this framework.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/connector-framework-a-generic-approach-to-crawl-activities-in-real-time-4e483f31d698\">Connector framework: A generic approach to crawl activities in real-time<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/connector-framework-a-generic-approach-to-crawl-activities-in-real-time-4e483f31d698?source=rss----cfe1120185d3---4\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>Authors: Jayanth Parayil Kumarji, Evan Jiang, Kevin Terusaki, Zhidong Ke, Heng Zhang, Jeff Lowe, Yifeng Liu, Priyadarshini Mitra Introduction Sales Cloud empowers our customers to make quick and well-informed decisions, with all of the tools they need to manage their selling process. The features we work on in the Activity Platform team, spanning Sales Cloud&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/10\/21\/connector-framework-a-generic-approach-to-crawl-activities-in-real-time\/\">Continue reading <span class=\"screen-reader-text\">Connector framework: A generic approach to crawl activities in real-time<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-494","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":345,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/real-time-einstein-insights-using-kafka-streams\/","url_meta":{"origin":494,"position":0},"title":"Real-time Einstein Insights Using Kafka Streams","date":"August 31, 2021","format":false,"excerpt":"Sales representatives deal with hundreds of emails everyday. To help them prioritize, Salesforce offers critical insights on emails received. These insights are either generated by our deep learning models or defined by the customer by matching keywords using regex expressions. Insights are generated in real time in our microservice architecture,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":876,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/10\/sales-clouds-ai-transformation-welcome-to-the-autonomous-selling-era\/","url_meta":{"origin":494,"position":1},"title":"Sales Cloud\u2019s AI Transformation: Welcome to the Autonomous Selling Era","date":"June 10, 2024","format":false,"excerpt":"In our enlightening \u201cEngineering Energizers\u201d Q&A series, we explore the transformative experiences of engineers who have pioneered advancements in their fields. Today, we meet Parul Jain, Vice President of Software Engineering at Salesforce, who steers AI innovations within Sales Cloud. Her team is dedicated to developing a fully autonomous selling\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":866,"url":"https:\/\/fde.cat\/index.php\/2024\/05\/15\/revealing-einsteins-blueprint-for-creating-the-new-unified-ai-platform-from-siloed-legacy-stacks\/","url_meta":{"origin":494,"position":2},"title":"Revealing Einstein\u2019s Blueprint for Creating the New, Unified AI Platform from Siloed Legacy Stacks","date":"May 15, 2024","format":false,"excerpt":"In our insightful \u201cEngineering Energizers\u201d Q&A series, we delve into the inspiring journeys of engineering leaders who have achieved remarkable success in their specific domains. Today, we meet Indira Iyer, Senior Vice President of Salesforce Engineering, leading Salesforce Einstein development. Her team\u2019s mission is to build Salesforce\u2019s next-gen AI Platform,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":283,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/ai-research-to-production-with-einstein-reply-recommendations\/","url_meta":{"origin":494,"position":3},"title":"AI Research to Production with Einstein Reply Recommendations","date":"August 31, 2021","format":false,"excerpt":"We all know that AI is here and it\u2019s quickly changing our lives. However, the impacts of AI are unevenly distributed and it favors those with \u201cmore data,\u201d leaving those with \u201cfew data\u201d behind. This runs counter to our Salesforce core values of Customer Success and Equality, so we set\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":288,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/building-a-successful-enterprise-ai-platform\/","url_meta":{"origin":494,"position":4},"title":"Building a Successful Enterprise AI Platform","date":"August 31, 2021","format":false,"excerpt":"IntroductionIn 2016, I started as a fresh grad software engineer at a small startup called MetaMind, which was acquired by Salesforce. Since then, it has been quite a journey to achieve a lot with a small team. I\u2019m part of Einstein Vision and Language Platform team. Our platform provides customers\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":834,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/06\/how-the-new-einstein-1-platform-manages-massive-data-and-ai-workloads-at-scale\/","url_meta":{"origin":494,"position":5},"title":"How the New Einstein 1 Platform Manages Massive Data and AI Workloads at Scale","date":"March 6, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we feature Leo Tran, Chief Architect of Platform Engineering at Salesforce. With over 15 years of engineering leadership experience, Leo is instrumental in developing the Einstein 1 Platform. This platform integrates generative AI, data management, CRM capabilities, and trusted systems to provide businesses with\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/494","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=494"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/494\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=494"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=494"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=494"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}