{"id":753,"date":"2023-08-29T16:00:30","date_gmt":"2023-08-29T16:00:30","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2023\/08\/29\/scheduling-jupyter-notebooks-at-meta\/"},"modified":"2023-08-29T16:00:30","modified_gmt":"2023-08-29T16:00:30","slug":"scheduling-jupyter-notebooks-at-meta","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2023\/08\/29\/scheduling-jupyter-notebooks-at-meta\/","title":{"rendered":"Scheduling Jupyter Notebooks at Meta"},"content":{"rendered":"<p><span>At Meta, <\/span><a href=\"https:\/\/developers.facebook.com\/blog\/post\/2021\/09\/20\/eli5-bento-interactive-notebook-empowers-development-collaboration-best-practices\/\" target=\"_blank\" rel=\"noopener\"><span>Bento<\/span><\/a><span> is our internal <\/span><a href=\"https:\/\/jupyter.org\/\" target=\"_blank\" rel=\"noopener\"><span>Jupyter<\/span><\/a><span> notebooks platform that is leveraged by many internal users. Notebooks are also being used widely for creating reports and workflows (for example, performing <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Extract,_transform,_load\" target=\"_blank\" rel=\"noopener\"><span>data ETL<\/span><\/a><span>) that need to be repeated at certain intervals. Users with such notebooks would have to remember to manually run their notebooks at the required cadence \u2013 a process people might forget because it does not scale with the number of notebooks used.<\/span><\/p>\n<p><span>To address this problem, we invested in building a scheduled notebooks infrastructure that fits in seamlessly with the rest of the internal tooling available at Meta. Investing in infrastructure helps ensure that privacy is inherent in everything we build. It enables us to continue building innovative, valuable solutions in a privacy-safe way.\u00a0<\/span><\/p>\n<p><span>The ability to transparently answer questions about data flow through Meta systems for purposes of data privacy and complying with regulations differentiates our scheduled notebooks implementation from the rest of the industry.<\/span><\/p>\n<p><span>In this post, we\u2019ll explain how we married Bento with our batch ETL pipeline framework called <\/span><a href=\"https:\/\/www.youtube.com\/watch?v=4T-MCYWrrOw\" target=\"_blank\" rel=\"noopener\"><span>Dataswarm<\/span><\/a><span> (think <\/span><a href=\"https:\/\/airflow.apache.org\/\" target=\"_blank\" rel=\"noopener\"><span>Apache Airflow<\/span><\/a><span>) in a privacy and lineage-aware manner.<\/span><\/p>\n<h2><span>The challenge around doing scheduled notebooks at Meta<\/span><\/h2>\n<p><span>At Meta, we\u2019re committed to improving confidence in production by performing <\/span><a href=\"https:\/\/engineering.fb.com\/2022\/11\/30\/data-infrastructure\/static-analysis-sql-queries\/\" target=\"_blank\" rel=\"noopener\"><span>static analysis<\/span><\/a><span> on scheduled artifacts and maintaining coherent narratives around dataflows by leveraging transparent <\/span><a href=\"https:\/\/airflow.apache.org\/docs\/apache-airflow\/stable\/core-concepts\/operators.html\" target=\"_blank\" rel=\"noopener\"><span>Dataswarm Operators<\/span><\/a><span> and data annotations. Notebooks pose a special challenge because:<\/span><\/p>\n<p><span>Due to dynamic code content (think table names created via f-strings, for instance), static analysis won\u2019t work, making it harder to understand data lineage.<\/span><br \/>\n<span>Since notebooks can have any arbitrary code, their execution in production is considered \u201copaque\u201d as data lineage cannot be determined, validated, or recorded.\u00a0<\/span><br \/>\n<span>Scheduled notebooks are considered to be on the production side of the production-development barrier. Before anything runs in production, it needs to be reviewed, and reviewing notebook code is non-trivial.<\/span><\/p>\n<p><span>These three considerations shaped and influenced our design decisions. In particular, we limited notebooks that can be scheduled to those primarily performing ETL and those performing data transformations and displaying visualizations. Notebooks with any other side effects are currently out of scope and are not eligible to be scheduled.<\/span><\/p>\n<h2><span>How scheduled notebooks work at Meta<\/span><\/h2>\n<p><span>There are three main components for supporting scheduled notebooks:<\/span><\/p>\n<p><span>The UI for setting up a schedule and creating a diff (Meta\u2019s pull request equivalent) that needs to be reviewed before the notebook and associated dataswarm pipeline gets checked into source control.<\/span><br \/>\n<span>The debugging interface once a notebook has been scheduled.\u00a0<\/span><br \/>\n<span>The integration point (a custom <\/span><a href=\"https:\/\/airflow.apache.org\/docs\/apache-airflow\/stable\/core-concepts\/operators.html\" target=\"_blank\" rel=\"noopener\"><span>Operator<\/span><\/a><span>) with Meta\u2019s internal scheduler to actually run the notebook. We\u2019re calling this: <\/span><span>BentoOperator<\/span><span>.<\/span><\/p>\n<h2><span>How BentoOperator works<\/span><\/h2>\n<p><span>In order to address the majority of the concerns highlighted above, we <\/span>perform the notebook execution state in a container without access to the network. <span>We also leverage <\/span><span><span>input<\/span><span> &amp; <\/span><span>output<\/span><\/span><span> data annotations to show the flow of data.<\/span><\/p>\n<p>The overall design for BentoOperator.<\/p>\n<p><span>For ETL, we fetch data and write it out in a novel way:<\/span><\/p>\n<p><span>Supported notebooks perform data fetches in a structured manner via custom cells that we\u2019ve built. An example of this is the SQL cell. When <\/span><span>BentoOperator<\/span><span> runs, the first step involves parsing metadata associated with these cells and fetching the data using transparent Dataswarm Operators and persisting this in local csv files on the ephemeral remote hosts.<\/span><br \/>\n<span>Instances of these custom cells are then replaced with a call to <\/span><span><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.read_csv.html\" target=\"_blank\" rel=\"noopener\"><span>pandas.read_csv()<\/span><\/a><\/span><span> to load that data in the notebook, unlocking the ability to execute the notebook without any access to the network.<\/span><br \/>\n<span>Data writes also leverage a custom cell, which we replace with a call to <\/span><span><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.to_csv.html\" target=\"_blank\" rel=\"noopener\"><span>pandas.DataFrame.to_csv()<\/span><\/a><\/span><span> to persist to a local csv file, which we then process after the actual notebook execution is complete and upload the data to the warehouse using transparent Dataswarm Operators.<\/span><br \/>\n<span>After this step, the temporary csv files are garbage-collected; the resulting notebook version with outputs uploaded and the ephemeral execution host deallocated.<\/span><\/p>\n<p>Custom SQL cell supported for scheduled notebooks.<br \/>\nStructured custom cell for data uploads.<\/p>\n<h2><span>Our approach to privacy with BentoOperator<\/span><\/h2>\n<p><span>We have integrated <\/span><span>BentoOperator<\/span><span> within Meta\u2019s data purpose framework to ensure that data is used only for the purpose it was intended. This framework ensures that the data usage purpose is respected as data flows and transmutes across Meta\u2019s stack. As part of scheduling a notebook, a \u201cpurpose policy zone\u201d is supplied by the user and this serves as the integration point with the data purpose framework.<\/span><\/p>\n<h2><span>Overall user workflow<\/span><\/h2>\n<p><span>Let\u2019s now explore the workflow for scheduling a notebook:<\/span><\/p>\n<p><span>We\u2019ve exposed the scheduling entry point directly from the notebook header, so all users have to do is hit a button to get started.<\/span><\/p>\n\n<p><span>The first step in the workflow is setting up some parameters that will be used for automatically generating the pipeline for the schedule.<\/span><\/p>\n\n<p><span>The next step involves previewing the generated pipeline before a <\/span><a href=\"https:\/\/engineering.fb.com\/2023\/06\/27\/developer-tools\/meta-developer-tools-open-source\/\"><span>Phabricator <\/span><\/a><span>(Meta\u2019s diff review tool) diff is created.<\/span><\/p>\n<p><span>In addition to the pipeline code for running the notebook, the notebook itself is also checked into source control so it can be reviewed. The results of attempting to run the notebook in a scheduled setup are also included in the test plan.\u00a0<\/span><\/p>\n<p><span>Once the diff has been reviewed and landed, the schedule starts running the next day. In the event that the notebook execution fails for whatever reason, the schedule owner is automatically notified. We\u2019ve also built a context pane extension directly in Bento to help with debugging notebook runs.<\/span><\/p>\n<h2><span>What\u2019s next for scheduled notebooks<\/span><\/h2>\n<p><span>While we\u2019ve addressed the challenge of supporting scheduled notebooks in a privacy-aware manner, the notebooks that are in scope for scheduling are limited to those performing ETL or those performing data analysis with no other side effects. This is only a fraction of the notebooks that users want to eventually schedule. In order to increase the number of use cases, we\u2019ll be investing in supporting other transparent data sources in addition to the SQL cell.\u00a0<\/span><\/p>\n<p><span>We have also begun work on supporting parameterized notebooks in a scheduled setup. The idea is to support instances where instead of checking in <\/span><span>many<\/span><span> notebooks into source control that only differ by a few variables, we instead just check in one notebook and inject the differentiating parameters during runtime.<\/span><\/p>\n<p><span>Lastly, we\u2019ll be working on event-based scheduling (in addition to the time-based approach we have here) so that a scheduled notebook can also wait for predefined events before running. This would include, for example, the ability to wait until all data sources the notebook depends on land before notebook execution can begin.<\/span><\/p>\n<h2><span>Acknowledgments\u00a0<\/span><\/h2>\n<p><span>Some of the approaches we took were directly inspired by the work done on <\/span><a href=\"https:\/\/papermill.readthedocs.io\/en\/latest\/\" target=\"_blank\" rel=\"noopener\"><span>Papermill<\/span><\/a><span>.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2023\/08\/29\/security\/scheduling-jupyter-notebooks-meta\/\">Scheduling Jupyter Notebooks at Meta<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>At Meta, Bento is our internal Jupyter notebooks platform that is leveraged by many internal users. Notebooks are also being used widely for creating reports and workflows (for example, performing data ETL) that need to be repeated at certain intervals. Users with such notebooks would have to remember to manually run their notebooks at the&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2023\/08\/29\/scheduling-jupyter-notebooks-at-meta\/\">Continue reading <span class=\"screen-reader-text\">Scheduling Jupyter Notebooks at Meta<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-753","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":566,"url":"https:\/\/fde.cat\/index.php\/2022\/04\/26\/sql-notebooks-combining-the-power-of-jupyter-and-sql-editors-for-data-analytics\/","url_meta":{"origin":753,"position":0},"title":"SQL Notebooks: Combining the power of Jupyter and SQL editors for data analytics","date":"April 26, 2022","format":false,"excerpt":"At Meta, our internal data tools are the main channel from our data scientists to our production engineers. As such, it\u2019s important for us to empower our scientists and engineers not only to use data to make decisions, but also to do so in a secure and compliant way. We\u2019ve\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":875,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/10\/serverless-jupyter-notebooks-at-meta\/","url_meta":{"origin":753,"position":1},"title":"Serverless Jupyter Notebooks at Meta","date":"June 10, 2024","format":false,"excerpt":"At Meta, Bento, our internal Jupyter notebooks platform, is a popular tool that allows our engineers to mix code, text, and multimedia in a single document. Use cases run the entire spectrum from what we call \u201clite\u201d workloads that involve simple prototyping to heavier and more complex machine learning workflows.\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":224,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/building-a-secured-data-intelligence-platform\/","url_meta":{"origin":753,"position":2},"title":"Building a Secured Data Intelligence Platform","date":"February 2, 2021","format":false,"excerpt":"The Salesforce Unified Intelligence Platform (UIP) team is building a shared, central, internal data intelligence platform. Designed to drive business insights, UIP helps improve user experience, product quality, and operations. At Salesforce, Trust is our number one company value and building in robust security is a key component of our\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":881,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/14\/25-productivity-tools-that-power-salesforce-engineering-teams\/","url_meta":{"origin":753,"position":3},"title":"25 Productivity Tools that Power Salesforce Engineering Teams","date":"June 14, 2024","format":false,"excerpt":"In this special edition of \u201cEngineering Energizers,\u201d we\u2019re celebrating Salesforce\u2019s 25th anniversary by showcasing 25 key productivity tools favored by leading engineers at Salesforce across India, the U.S., Israel, and Argentina. Explore the essential tools these experts rely on to enhance their productivity, tackle complex problems, and elevate innovation. 1.\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":806,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/19\/ai-debugging-at-meta-with-hawkeye\/","url_meta":{"origin":753,"position":4},"title":"AI debugging at Meta with HawkEye","date":"December 19, 2023","format":false,"excerpt":"HawkEye is the powerful toolkit used internally at Meta for monitoring, observability, and debuggability of the end-to-end machine learning (ML) workflow that powers ML-based products. HawkEye supports recommendation and ranking models across several products at Meta. Over the past two years, it has facilitated order of magnitude improvements in the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":839,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/18\/logarithm-a-logging-engine-for-ai-training-workflows-and-services\/","url_meta":{"origin":753,"position":5},"title":"Logarithm: A logging engine for AI training workflows and services","date":"March 18, 2024","format":false,"excerpt":"Systems and application logs play a key role in operations, observability, and debugging workflows at Meta. Logarithm is a hosted, serverless, multitenant service, used only internally at Meta, that consumes and indexes these logs and provides an interactive query interface to retrieve and view logs. In this post, we present\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/753","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=753"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/753\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=753"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=753"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=753"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}