{"id":658,"date":"2022-11-30T17:00:10","date_gmt":"2022-11-30T17:00:10","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/11\/30\/enabling-static-analysis-of-sql-queries-at-meta\/"},"modified":"2022-11-30T17:00:10","modified_gmt":"2022-11-30T17:00:10","slug":"enabling-static-analysis-of-sql-queries-at-meta","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/11\/30\/enabling-static-analysis-of-sql-queries-at-meta\/","title":{"rendered":"Enabling static analysis of SQL queries at Meta"},"content":{"rendered":"<p><span>UPM is our internal standalone library to perform <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Static_program_analysis\"><span>static analysis<\/span><\/a><span> of SQL code and enhance SQL authoring.\u00a0<\/span><br \/>\n<span>UPM takes <\/span><a href=\"https:\/\/engineering.fb.com\/2022\/04\/26\/developer-tools\/sql-notebooks\/\"><span>SQL<\/span><\/a><span> code as input and represents it as a data structure called a semantic tree.<\/span><br \/>\n<span>Infrastructure teams at Meta leverage UPM to build SQL linters, catch user mistakes in SQL code, and perform data lineage analysis at scale.<\/span><\/p>\n<p><span>Executing SQL queries against our data warehouse is important to the workflows of many engineers and data scientists at Meta for analytics and monitoring use cases, either as part of recurring data pipelines or for ad-hoc data exploration.\u00a0<\/span><\/p>\n<p><span>While SQL is extremely powerful and very popular among our engineers, we\u2019ve also faced some challenges over the years, namely:\u00a0<\/span><\/p>\n<p>A need for static analysis capabilities: <span>In a growing number of use cases at Meta, we must understand programmatically what happens in SQL queries before they are executed against our query engines \u2014 a task called static analysis.\u00a0 <\/span><span>These use cases range from performance linters (suggesting query optimizations that query engines cannot perform automatically) and analyzing data lineage (tracing how data flows from one table to another). <\/span>This was hard for us to do for two reasons: First, while query engines internally have some capabilities to analyze a SQL query in order to execute it, this query analysis component is typically deeply embedded inside the query engine\u2019s code. It is not easy to extend upon, and it is not intended for consumption by other infrastructure teams. In addition to this, each query engine has its own analysis logic, specific to its own SQL dialect; as a result, a team who wants to build a piece of analysis for SQL queries would have to reimplement it from scratch inside of each SQL query engine.<\/p>\n<p>A limiting type system:<span> Initially, we used only the fixed set of<\/span><a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/LanguageManual+Types\"> <span>built-in Hive data types<\/span><\/a><span> (<\/span><span>string<\/span><span>, <\/span><span>integer<\/span><span>, <\/span><span>boolean<\/span><span>, etc.<\/span><span>) <\/span><span>to describe table columns in our data warehouse<\/span><span>. <\/span><span>As our warehouse grew more complex, this set of types became insufficient, as it left us unable to catch common categories of user errors, such as unit errors (imagine making a UNION between two tables, both of which contain a column called <\/span><span>timestamp<\/span><span>, but one is encoded in milliseconds and the other one in nanoseconds), or ID comparison errors (imagine a JOIN between two tables, each with a column called<\/span> <span>user_id<\/span><span><span> \u2014 but, in fact, those IDs are issued by different <\/span><\/span><span>systems and therefore cannot be compared).<\/span><\/p>\n<h2><span>How UPM works<\/span><\/h2>\n<p><span>To address these challenges, we have built UPM (Unified Programming Model). UPM takes in an SQL query as input and represents it as a hierarchical data structure called a semantic tree.<\/span><\/p>\n<p><span>\u00a0<\/span><span>For example, if you pass in this query to UPM:<\/span><\/p>\n<p>SELECT<br \/>\nCOUNT(DISTINCT user_id) AS n_users<br \/>\nFROM login_events<\/p>\n<p><span>UPM will return this semantic tree:<\/span><\/p>\n<p>SelectQuery(<br \/>\n \titems=[<br \/>\n \tSelectItem(<br \/>\n       \tname=&#8221;n_users&#8221;,<br \/>\n       \ttype=upm.Integer,<br \/>\n       \tvalue=CallExpression(<br \/>\n            \tfunction=upm.builtin.COUNT_DISTINCT,<br \/>\n                arguments=[ColumnRef(name=&#8221;user_id&#8221;, parent=Table(&#8220;login_events&#8221;))],<br \/>\n       \t),<br \/>\n \t)<br \/>\n    ],<br \/>\n    parent=Table(&#8220;login_events&#8221;),<br \/>\n)<\/p>\n<p><span>\u00a0<\/span><span>Other tools can then use this semantic tree for different use cases, such as:<\/span><\/p>\n<p>Static analysis:<span> A tool can <\/span>inspect <span>the semantic tree and then output diagnostics or warnings about the query (such as a SQL linter).<\/span><br \/>\nQuery rewriting:<span> A tool can <\/span>modify <span>the semantic tree to rewrite the query. <\/span><br \/>\nQuery execution: <span>UPM can act as a pluggable SQL front end, meaning that a database engine or query engine can use a UPM semantic tree directly to generate and execute a query plan. (The word <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Compiler#Front_end\"><span>front end<\/span><\/a> <span>in this context is borrowed from the world of compilers; the front end is the part of a compiler that converts higher-level code into an intermediate representation that will ultimately be used to generate an executable program). Alternatively, UPM can render the semantic tree back into a target SQL dialect (as a string) and pass that to the query engine.<\/span><\/p>\n<h2><span>A unified SQL language front end<\/span><\/h2>\n<p><span>UPM allows us to provide a single language front end to our SQL users so that they only need to work with a single language (a superset of the<\/span><a href=\"https:\/\/prestodb.io\/docs\/current\/\"> <span>Presto<\/span><\/a><span> SQL dialect) \u2014 whether their target engine is Presto, Spark, or XStream, our in-house stream processing service.<\/span><\/p>\n<p><span>This unification is also beneficial to our data infrastructure teams: Thanks to this unification, teams that own SQL static analysis or rewriting tools can use UPM semantic trees as a standard interop format, without worrying about parsing, analysis, or integration with different SQL query engines and SQL dialects. Similarly, much like <\/span><a href=\"https:\/\/engineering.fb.com\/2022\/08\/31\/open-source\/velox\/\"><span>Velox<\/span><\/a><span> can act as a pluggable <\/span><span>execution engine<\/span><span> for data management systems, UPM can act as a pluggable <\/span><span>language front end<\/span><span> for data management systems, saving teams the effort of maintaining their own SQL front end.<\/span><\/p>\n\n<h2><span>Enhanced type-checking<\/span><\/h2>\n<p><span>UPM also allows us to provide enhanced type-checking<\/span> <span>of SQL queries.<\/span><\/p>\n<p><span>\u00a0<\/span><span>In our warehouse, each table column is assigned a \u201cphysical\u201d type from a fixed list, such as <\/span><span>integer<\/span><span> or <\/span><span>string<\/span><span>. Additionally, each column can have an optional user-defined type; while it does not affect how the data is encoded on disk, this type can supply semantic information (e.g., Email, TimestampMilliseconds, or UserID). UPM can take advantage of these user-defined types to improve static type-checking of SQL queries.<\/span><\/p>\n<p><span>\u00a0<\/span><span>For example, an SQL query author might want to UNION data from two tables that contain information about different login events:<\/span><\/p>\n\n<p><span>\u00a0<\/span><span>In the query on the right, the author is trying to combine timestamps in milliseconds from the table<\/span> <span>user_login_events_mobile<\/span> <span>with timestamps in nanoseconds from the table<\/span> <span>user_login_events_desktop<\/span> <span>\u2014 an understandable mistake, as the two columns have the same name. But because the tables\u2019 schema have been annotated with user-defined types, UPM\u2019s typechecker catches the error before the query reaches the query engine; it then notifies the author in their code editor. Without this check, the query would have completed successfully, and the author might not have noticed the mistake until much later.<\/span><\/p>\n<h2><span>Column-level data lineage<\/span><\/h2>\n<p><span>Data lineage \u2014 understanding how data flows within our warehouse and through to consumption surfaces \u2014 is a foundational piece of our data infrastructure. It enables us to answer data quality questions (e.g.,\u201cThis data looks incorrect; where is it coming from?\u201d and \u201cData in this table were corrupted; which downstream data assets were impacted?\u201d). It also helps with data refactoring (\u201cIs this table safe to delete? Is anyone still depending on it?\u201d).\u00a0<\/span><\/p>\n<p><span>\u00a0<\/span><span>To help us answer those critical questions, our data lineage team has built a query analysis tool that takes UPM semantic trees as input. The tool examines all recurring SQL queries to build a column-level data lineage graph across our entire warehouse. For example, given this query:<\/span><\/p>\n<p>INSERT INTO user_logins_daily_agg<br \/>\nSELECT<br \/>\n   DATE(login_timestamp) AS day,<br \/>\n   COUNT(DISTINCT user_id) AS n_users<br \/>\nFROM user_login_events<br \/>\nGROUP BY 1<\/p>\n<p><span>Our UPM-powered column lineage analysis would deduce these edges:<\/span><\/p>\n<p>[{<br \/>\n   from: \u201cuser_login_events.login_timestamp\u201d,<br \/>\n   to: \u201cuser_login_daily_agg.day\u201d,<br \/>\n   transform: \u201cDATE\u201d<br \/>\n},<br \/>\n{<br \/>\n   from: \u201cuser_login_events.user_id\u201d,<br \/>\n   to: \u201cuser_logins_daily_agg.n_user\u201d,<br \/>\n   transform: \u201cCOUNT_DISTINCT\u201d<br \/>\n}]  <\/p>\n<p><span>By putting this information together for every query executed against our data warehouse each day, the tool shows us a global view of the full column-level data lineage graph.<\/span><\/p>\n<h2><span>What\u2019s next for UPM<\/span><\/h2>\n<p><span>We look forward to more exciting work as we continue to unlock UPM\u2019s full potential at Meta. Eventually, we hope all Meta warehouse tables will be annotated with user-defined types and other metadata, and that enhanced type-checking will be strictly enforced in every authoring surface. Most tables in our Hive warehouse already leverage user-defined types, but we are rolling out stricter type-checking rules gradually, to facilitate the migration of existing SQL pipelines.<\/span><\/p>\n<p><span>We have already integrated UPM into the main surfaces where Meta\u2019s developers write SQL, and our long-term goal is for UPM to become Meta\u2019s unified SQL front end: deeply integrated into all our query engines, exposing a single SQL dialect to our developers. We also intend to iterate on the ergonomics of this unified SQL dialect (for example, by allowing trailing commas in <\/span><span>SELECT <\/span><span>clauses and by supporting syntax constructs like<\/span> <span>SELECT * EXCEPT &lt;some_columns&gt;<\/span><span>, which already exist in some SQL dialects) and to ultimately raise the level of abstraction at which people write their queries.<\/span><\/p>\n<p>\u00a0<\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2022\/11\/30\/data-infrastructure\/static-analysis-sql-queries\/\">Enabling static analysis of SQL queries at Meta<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>UPM is our internal standalone library to perform static analysis of SQL code and enhance SQL authoring.\u00a0 UPM takes SQL code as input and represents it as a data structure called a semantic tree. Infrastructure teams at Meta leverage UPM to build SQL linters, catch user mistakes in SQL code, and perform data lineage analysis&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/11\/30\/enabling-static-analysis-of-sql-queries-at-meta\/\">Continue reading <span class=\"screen-reader-text\">Enabling static analysis of SQL queries at Meta<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-658","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":566,"url":"https:\/\/fde.cat\/index.php\/2022\/04\/26\/sql-notebooks-combining-the-power-of-jupyter-and-sql-editors-for-data-analytics\/","url_meta":{"origin":658,"position":0},"title":"SQL Notebooks: Combining the power of Jupyter and SQL editors for data analytics","date":"April 26, 2022","format":false,"excerpt":"At Meta, our internal data tools are the main channel from our data scientists to our production engineers. As such, it\u2019s important for us to empower our scientists and engineers not only to use data to make decisions, but also to do so in a secure and compliant way. We\u2019ve\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":753,"url":"https:\/\/fde.cat\/index.php\/2023\/08\/29\/scheduling-jupyter-notebooks-at-meta\/","url_meta":{"origin":658,"position":1},"title":"Scheduling Jupyter Notebooks at Meta","date":"August 29, 2023","format":false,"excerpt":"At Meta, Bento is our internal Jupyter notebooks platform that is leveraged by many internal users. Notebooks are also being used widely for creating reports and workflows (for example, performing data ETL) that need to be repeated at certain intervals. Users with such notebooks would have to remember to manually\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":636,"url":"https:\/\/fde.cat\/index.php\/2022\/09\/27\/a-peek-at-datoramas-aws-s3-sql-query-tool\/","url_meta":{"origin":658,"position":2},"title":"A Peek at Datorama\u2019s AWS S3 SQL Query Tool","date":"September 27, 2022","format":false,"excerpt":"Datorama Reports for Marketing Cloud enables you to generate, view, and share a detailed analysis of your Email, Push, and Journey campaign-level data. For that, Datorama extracts large volumes of data to its Data Lake solution for the Marketing Cloud marketing analytics, which is stored in a structured table compatible\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":480,"url":"https:\/\/fde.cat\/index.php\/2021\/09\/29\/open-sourcing-mariana-trench-analyzing-android-and-java-app-security-in-depth\/","url_meta":{"origin":658,"position":3},"title":"Open-sourcing Mariana Trench: Analyzing Android and Java app security in depth","date":"September 29, 2021","format":false,"excerpt":"We\u2019re sharing details about Mariana Trench (MT), a tool we use to spot and prevent security and privacy bugs in Android and Java applications. As part of our effort to help scale security through building automation, we recently open-sourced MT to support security engineers at Facebook and across the industry.\u00a0\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":875,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/10\/serverless-jupyter-notebooks-at-meta\/","url_meta":{"origin":658,"position":4},"title":"Serverless Jupyter Notebooks at Meta","date":"June 10, 2024","format":false,"excerpt":"At Meta, Bento, our internal Jupyter notebooks platform, is a popular tool that allows our engineers to mix code, text, and multimedia in a single document. Use cases run the entire spectrum from what we call \u201clite\u201d workloads that involve simple prototyping to heavier and more complex machine learning workflows.\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":181,"url":"https:\/\/fde.cat\/index.php\/2020\/12\/10\/coral-a-sql-translation-analysis-and-rewrite-engine-for-modern-data-lakehouses\/","url_meta":{"origin":658,"position":5},"title":"Coral: A SQL translation, analysis, and rewrite engine for modern data lakehouses","date":"December 10, 2020","format":false,"excerpt":"Co-authors: Walaa Eldin Moustafa, Wenye Zhang, Sushant Raikar, Raymond Lam, Ron Hu, Shardul Mahadik, Laura Chen, Khai Tran, Chris Chen, and Nagarathnam Muthusamy Introduction At LinkedIn, our big data compute infrastructure continually grows over time, not only to keep pace with the growth in the number of data applications, or\u2026","rel":"","context":"In &quot;External&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/658","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=658"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/658\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=658"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=658"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=658"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}