{"id":180,"date":"2020-12-16T19:00:00","date_gmt":"2020-12-16T19:00:00","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2020\/12\/16\/a-b-testing-at-linkedin-assigning-variants-at-scale\/"},"modified":"2021-02-02T13:47:12","modified_gmt":"2021-02-02T13:47:12","slug":"a-b-testing-at-linkedin-assigning-variants-at-scale","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2020\/12\/16\/a-b-testing-at-linkedin-assigning-variants-at-scale\/","title":{"rendered":"A\/B testing at LinkedIn: Assigning variants at scale"},"content":{"rendered":"<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p><i>Co-authors: <a href=\"https:\/\/www.linkedin.com\/in\/alexanderivaniuk\" target=\"_blank\" rel=\"noopener\">Alexander Ivaniuk<\/a> and <a href=\"https:\/\/www.linkedin.com\/in\/weitaoduan\" target=\"_blank\" rel=\"noopener\">Weitao Duan<\/a><\/i><\/p>\n<p><i>Editor\u2019s note: This blog post is the second in a series providing an overview and history of LinkedIn\u2019s experimentation platform. The previous post on the history of LinkedIn\u2019s experimentation infrastructure can be found <a href=\"https:\/\/engineering.linkedin.com\/blog\/2020\/our-evolution-towards-t-rex--the-prehistory-of-experimentation-i\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/i><\/p>\n<h2>Introducing variant assignment<\/h2>\n<p>Previously on the blog, we\u2019ve shared a look into how experimentation works at LinkedIn with an overview of our experimentation platform, \u201cT-REX,\u201d and a deep dive into the architecture of the <a href=\"https:\/\/engineering.linkedin.com\/blog\/2020\/making-the-linkedin-experimentation-engine-20x-faster\" target=\"_blank\" rel=\"noopener\">Lix engine<\/a>. However, no matter how well-designed an infrastructure is, the ability to scale depends on an efficient and scientifically-valid variant assignment method. In this post, we will discuss a practical approach of making variant assignment blazing fast to handle trillions of invocations per day, as well as invariants and tricky parts of using such methods for online experiments.<\/p>\n<p>The validity of A\/B testing is rooted in the assumption that each variant is assigned to a random member. That\u2019s why it\u2019s critical to find a good randomization algorithm. There are three main characteristics of a good variant assignment function:<\/p>\n<ol>\n<li>Assignment of variants to members happens according to the desired split (i.e., no sample size ratio mismatch)<\/li>\n<li>Variant assignment of a single member is deterministic (i.e., repeatable)<\/li>\n<li>No correlation between the assignments when there are multiple experiments running<\/li>\n<\/ol>\n<p>In particular, the third characteristic is critical because it suggests that a member\u2019s assignment in one experiment has no effect on the probability of being assigned to a variant in other experiments. This is important because we want to analyze each experiment separately. We will talk more about this characteristic when we go through an example below.<\/p>\n<h2>General approach<\/h2>\n<p>Suppose we have a test <i>T<\/i>, its population <i>P<\/i>, and variants <i>V<sub>0<\/sub><\/i>, <i>V<sub>1<\/sub><\/i>, <i>V<sub>2<\/sub><\/i>, \u2026, <i>V<sub>N-1<\/sub><\/i>, where each of the variants is assigned a fraction of <i>P<\/i> equal to <i>A<sub>0<\/sub><\/i>, <i>A<sub>1<\/sub><\/i>, \u2026, <i>A<sub>N-1<\/sub><\/i>, and <i>A<sub>0<\/sub><\/i> + <i>A<sub>1<\/sub><\/i> + \u2026 + <i>A<sub>N-1<\/sub><\/i> = 1. Let\u2019s assume that <i>P<\/i> is a member population\u00a0and that all members are assigned with integer identifiers.<\/p>\n<p>Let\u2019s represent the variants in a one-dimensional coordinate system on a line segment [0, 1). All the variants\u2019 subpopulations can be represented as subsegments within the [0, 1) segment: <i>V<sub>0<\/sub><\/i>\u2019s population will correspond to [0, <i>A<sub>0<\/sub><\/i>), <i>V<sub>1<\/sub><\/i>\u2018s population to [<i>A<sub>0<\/sub><\/i>, <i>A<sub>0<\/sub><\/i> + <i>A<sub>1<\/sub><\/i>), \u2026, <i>V<sub>N-1<\/sub><\/i>\u2018s population to [<i>A<sub>0<\/sub><\/i> + <i>A<sub>1<\/sub><\/i> + \u2026 + <i>A<sub>N-2<\/sub><\/i>, 1]. The problem of assigning a variant to a member is now reduced to selecting a point on the line segment [0, 1).\u00a0<\/p>\n<p>In order to assign a variant, the following algorithm can be used:<\/p>\n<ol>\n<li>Project the entire test\u2019s population onto [0, 1)<\/li>\n<li>Shuffle it in a pseudo-random way along the segment<\/li>\n<li>Find a variant subsegment that owns the shuffled projection point of a member and return the corresponding variant<\/li>\n<\/ol>\n<p>In a more concrete example, let\u2019s assume the following:<\/p>\n<ol>\n<li>Population <i>P<\/i> is 100,000 members large.<\/li>\n<li>The experiment of interest has 3 variants, <i>A<\/i>, <i>B<\/i>, <i>C<\/i>, that are assigned to 20%, 40%, and 40% of <i>P<\/i>, respectively.\u00a0<\/li>\n<li>We have members with IDs, as shown in the picture below, to perform variant assignment.<\/li>\n<\/ol>\n<p>For the purpose of this explanation, we are interested in how member ID #8000 gets his or her treatment:<\/p>\n<ol>\n<li>Member ID #8000 is mapped to the point 0.08 as a result of computing <span class=\"monospace\">&lt;MEMBER_ID&gt; \/ &lt;NUM_MEMBERS&gt;<\/span><\/li>\n<li>Then, we apply a random transformation to point 0.08 and project it into point 0.848.<\/li>\n<li>We check the variant allocation axis and see that members with ids mapped to points with coordinates between 0.6 and 1 get variant <i>C<\/i>, so member ID #8000 also is assigned with variant <i>C<\/i>.<\/li>\n<\/ol><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_1701904319\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2020\/12\/variant-1.png?resize=750%2C684&#038;ssl=1\" alt=\"graph-showing-variant-mapping-for-member-poopulation\" height=\"684\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_831864065\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p><i>Projection, shuffling, and final variant mapping for the member population<\/i><\/p>\n<h2>Assignment methods<\/h2>\n<p>There are two different ways of performing the variant assignment in practice. We will describe such methods based on the design described in our previous blog post of <a href=\"https:\/\/engineering.linkedin.com\/blog\/2020\/our-evolution-towards-t-rex--the-prehistory-of-experimentation-i\" target=\"_blank\" rel=\"noopener\">T-Rex<\/a>, which enables an experiment to target a subpopulation of Linkedin.com based on member attributes.\u00a0<\/p>\n<p><b>Use a random number generator (RNG). <\/b>The objective here is to randomly select a point in [0, 1) and then assign a corresponding variant to a member. Then the (test, variant, member) tuple is stored and retrieved from the store on all subsequent evaluations.<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_221506826\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2020\/12\/variant-2.png?resize=750%2C503&#038;ssl=1\" alt=\"graph-showing-random-number-generator-based-variant-assignment\" height=\"503\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_197603293\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p><i>RNG-based variant assignment<\/i><\/p>\n<p><b>Use hashing.<\/b> The variant can also be retrieved via hashing. Such a method requires the defining of a family of hash functions <span class=\"monospace\">HASH(test_id): memberId -&gt; [0, 1)<\/span> where:<\/p>\n<\/p>\n<ul>\n<li>each of the functions maps a space of member ids uniformly into [0, 1),<\/li>\n<li>\u201c<span class=\"monospace\">salt<\/span>\u201d is a constant, unique value chosen for each A\/B test (in the above example, the <span class=\"monospace\">test_id<\/span>),<\/li>\n<li>any two distinct functions from the family should generate independent distributions of the projection points.<\/li>\n<\/ul>\n<p>Such a family of hash functions could be created from a common crypto digest function <span class=\"monospace\">FCrypt: byte_array -&gt; [0, F_MAX]<\/span>, where <span class=\"monospace\">F_MAX<\/span> is the largest value returned by the function and <span class=\"monospace\">[0, F_MAX)<\/span> is an integer range. This entails the following steps:<\/p>\n<ol>\n<li>Remap the function into <span class=\"monospace\">[0, 1)<\/span> by applying <span class=\"monospace\">HASH(bytes) = F(bytes) \/ F_MAX<\/span>.<\/li>\n<li>Encode \u201c<span class=\"monospace\">salt<\/span>\u201d as a fixed-width byte prefix (e.g., a 4-byte prefix to make sure that it will be able to fit all possible values of salt).<\/li>\n<li>Encode <span class=\"monospace\">member_id<\/span> as a variable-size byte array.<\/li>\n<li>Then <span class=\"monospace\">HASH(salt)(member_id) = FCrypt(concat(prefix(salt, 4), bytes(member_id))<\/span>.<\/li>\n<\/ol><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_1942163071\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2020\/12\/variant-3.png?resize=750%2C304&#038;ssl=1\" alt=\"graph-showing-hash-based-variant-assignment\" height=\"304\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_1486059991\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p><i>Hash-based variant assignment\u00a0<\/i><\/p>\n<h2>Choosing the best method<\/h2>\n<p>The hash-based variant assignment approach is one of the cornerstones of LinkedIn\u2019s experimentation system. Consider this: the RNG approach requires experiment operators to execute a remote call to a cache or an \u201conline\u201d data store every time a test variant is evaluated. Such an online variant assignment store can evolve into a critical dependency and bottleneck for the entire experimentation platform, which means that availability, consistency, and speed of variant assignment will be limited by the parameters of the data store. Taking into consideration the <a href=\"https:\/\/en.wikipedia.org\/wiki\/CAP_theorem\" target=\"_blank\" rel=\"noopener\">CAP theorem<\/a>, the experimentation platform would not be able to achieve repeatability of the variant assignment\u2019s result, which requires strong consistency, high availability, and network partition tolerance.<\/p>\n<p>Let\u2019s look at a possible topology of such a solution below. Such a system may experience problems in one of the following scenarios:<\/p>\n<ul>\n<li>When the assignment service, the cache, or the store are slow or down.<\/li>\n<li>When there is a network partition event between data centers or the link that is slow, thus either hurting the availability of the system or impacting the replayability of the assignments<\/li>\n<li>When there is a network partition event inside the data center<\/li>\n<\/ul><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_1211792475\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2020\/12\/variant-4.png?resize=750%2C561&#038;ssl=1\" alt=\"infrastructure-topology-for-the-r-n-g-assignment\" height=\"561\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_397957196\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p><i>Infrastructure topology for the RNG assignment method<\/i><\/p>\n<p>On the other hand, the hash-based assignment approach is deterministic and enables us to use an advanced form of caching with about 99.98% requests being evaluated in application instances, and only a mere 0.02% of evaluations resulting in network requests to the experimentation backend using the following topology:<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_2060710798\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2020\/12\/variant-5.png?resize=750%2C282&#038;ssl=1\" alt=\"infrastructure-topology-for-the-hash-based-assignment\" height=\"282\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_1576651815\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>Such a great locality of evaluations is made possible by caching the experiment definitions within each service running A\/B tests and being able to process variant evaluations rules in the experimentation engine, which is a part of the A\/B client library:<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_1809339493\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2020\/12\/variant-6.png?resize=741%2C646&#038;ssl=1\" alt=\"graph-showing-how-the-infrastructure-from-external-request-to-experimentation-backend\" height=\"646\" width=\"741\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_2038236326\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>Referencing back to the CAP theorem, our <a href=\"https:\/\/engineering.linkedin.com\/blog\/2020\/our-evolution-towards-t-rex--the-prehistory-of-experimentation-i\" target=\"_blank\" rel=\"noopener\">infrastructure<\/a> is able to tolerate downtime of the experimentation backend, network partition events between or inside data centers, and any slowness of the experimentation backend.<\/p>\n<p>At our scale of up to 35 trillion evaluations and 41,000 A\/B tests, the difference between the assignment methods is substantial:<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceTable section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourcetable\"><\/a>\n <\/div>\n<div class=\"resource-table-section header-row\">\n<table>\n<tbody>\n<tr>\n<td><b>Metric\/Approach<\/b><\/td>\n<td><b>RNG (Estimated)<\/b><\/td>\n<td><b>Hash-based<\/b><\/td>\n<\/tr>\n<tr>\n<td>Assignment storage write QPS<\/td>\n<td>1,770,000 &#8211; 25,000,000<\/td>\n<td>50,000 Kafka write QPS for select assignment data<\/td>\n<\/tr>\n<tr>\n<td>Assignment storage disk read QPS<\/td>\n<td>26,000,000<\/td>\n<td>Up to 80,000<\/td>\n<\/tr>\n<tr>\n<td>Cache read Network QPS<\/td>\n<td>240,000,000-260,000,000<\/td>\n<td>Up to 1,000,000<\/td>\n<\/tr>\n<tr>\n<td>Online assignment data storage size<\/td>\n<td>26-390 TB\/day for assignment data<\/td>\n<td>200 GB for member attributes data, total\u00a0<\/td>\n<\/tr>\n<tr>\n<td>Offline data storage<\/td>\n<td>26-390 TB\/day for assignment data<\/td>\n<td>200 GB\/day for select assignment data and 200 GB\/day for member attributes<\/td>\n<\/tr>\n<tr>\n<td>Complexity of evaluations<\/td>\n<td>O(num_tests * num_members)<\/td>\n<td>O(num_members)<\/p>\n<p> Local evaluation of all tests for a member is significantly faster than a remote call and can be treated as a constant<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_617464606\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>In comparison, the RNG method would have required us to:<\/p>\n<ul>\n<li>Handle at least 240x QPS with a distributed cache than what we handle now in the experimentation backend.<\/li>\n<li>Have at least 3900x capacity in the online storage system for variant assignment, assuming a 30-day retention period for the RNG method\u2019s data.<\/li>\n<li>Handle a much higher disk read\/write pressure.<\/li>\n<\/ul>\n<h2>Recording or replaying variant assignment<\/h2>\n<p>In some cases, we continue to record variant assignment data. Services with A\/B tests send data to Kafka that then gets transferred to Hadoop (depicted in the diagram below). This is done to carry attributions of members\u2019 actions to certain experiments and variants during batch A\/B report generation, which is considerably less sensitive to both the latency of data synchronization and the availability of the assignment data store.<\/p>\n<p>In some cases, certain tests may result in trillions of variant assignments per day, so we offer a way for these respective test owners to optimize their Kafka resource footprint: a test owner may tell the experimentation platform to assume that the experiments\u2019 population comprises either all active members of the site or all the registered members. In that case, the experimentation client will not send experiment assignment data and the platform will approximate the variant assignment on Hadoop.<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_1774165896\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2020\/12\/variant-7.png?resize=750%2C563&#038;ssl=1\" alt=\"graph-showing-the-assignment-data-flow\" height=\"563\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_453045578\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p><i>The assignment data flow: storage, ETL, and offline approximation<\/i><\/p>\n<h2>Independence of variant assignment<\/h2>\n<p>When there are multiple experiments running, we want to be able to analyze each experiment separately. This can only be done based on the assumption that the assignments in all the experiments are independent. Imagine we have two concurrent experiments: Experiment 1 tests two relevance models for LinkedIn feed and Experiment 2 tests two font sizes. Let <i>Z<sub>1<\/sub><\/i> and <i>Z<sub>2<\/sub><\/i> denote the assignment of a member in Experiment 1 and 2 respectively. In mathematical formulation, the third characteristic can be written as:<\/p>\n<p><i>Prob(Z<sub>1\u00a0<\/sub>= z<sub>1<\/sub>) = Prob(Z<sub>1<\/sub>\u00a0= z<sub>1<\/sub>| Z<sub>2<\/sub> = z<sub>2<\/sub>)<\/i>, where <i>z<sub>1<\/sub><\/i>and <i>z<sub>2<\/sub><\/i> are the possible variants in Experiment 1 and 2.<\/p>\n<p>We call Experiment 1 and 2 orthogonal because the assignments are independent. If Experiment 1 and 2 were both set up with a 50-50 split between the two variants, members assigned to Model 1 have an equal chance of getting the two font sizes, and vice versa. Assume the baseline (Model 2 and Font 2) has 8 page views per member, Model 1 increases the average page views per member by 5 compared to Model 2, and Font Size 1 increases by 10 compared to Font Size 2. Let\u2019s also assume there is no interaction between the two tests. If we analyze Experiment 1 separately, the estimated effect of Model 1 vs. Model 2 is estimated below:<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_55375373\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2020\/12\/variant-formula.png?resize=550%2C119&#038;ssl=1\" alt=\"formula-comparing-estimated-effects-of-model-one-versus-model-two\" height=\"119\" width=\"550\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_2132708712\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>In other words, we can ignore that Experiment 2 is actually running and proceed to analyze Experiment 1 only.<\/p>\n<p>At LinkedIn, we apply MD5 hash function to the combination of member id and experiment identifier. MD5 hash function has also been tested in Chi-squared tests to satisfy the independent assignment characteristic.<\/p>\n<h2>Interactions between the variants<\/h2>\n<p>In the previous example, we assume there is no interaction without properly defining it. A statistical interaction between two variants <i>A<\/i> and <i>B<\/i> exists if their combined effect is not the same as the sum of two individual treatment effects. In the previous example, the combined effect of Model 1 and Font Size 1 could be 17, instead of the sum 15 because the combination may work better than applying individual variants. There are also examples of negative or harmful interactions. For example, let\u2019s say we have two tests, one of which controls the font color (blue vs. black), while the other controls the background color (blue vs. white). While blue font color may work better than black on the white background, it renders the text illegible on the blue background. We should also note that it is impossible to completely prevent interaction when operating in a large-scale experiment system. Teams working on different components of a larger product may be unaware of features tested by other teams and may not check for interactions.<\/p>\n<p>To detect the interaction, we provide a tool that can detect pairwise interactions between any two experiments. Three-way or higher-order interactions, although possible, are extremely rare. Our empirical experience suggests that even the pairwise interactions are quite rare, especially between experiments from two distinct product pillars (e.g., messaging and advertising). When interactions do show up, they are often in a smaller magnitude compared to the mean effects.<\/p>\n<h2>Sample size ratio mismatches<\/h2>\n<p>Sample size ratio mismatch (SSRM) happens when the observed sample size ratio between two variants does not follow the expected ratio. For example, we may expect a 50-50 split, but observe 100k members in the treatment group and 110k members in the control group. This usually indicates some underlying issues with the experiment setup. The reports from such experiments become untrustworthy and any decisions based on them must be treated as misleading to an extent.<\/p>\n<p>At LinkedIn, we leverage the Chi-squared goodness-of-fit test to monitor for ratio mismatch. For simplicity, we assume there are only two variants in the experiment, <i>n<sub>t<\/sub><\/i> and <i>n<sub>c<\/sub><\/i> are the observed sample sizes for the treatment and control group, and the expected treatment ramp ratio is <i>r<sub>t<\/sub><\/i>. The expected sample sizes for the treatment and control group are <i>E<sub>t<\/sub> = (n<sub>t<\/sub> + n<sub>c<\/sub>) r<sub>t<\/sub><\/i> and <i>E<sub>c<\/sub> = (n<sub>t<\/sub> + n<sub>c<\/sub>)(1 &#8211; r<sub>t<\/sub>)<\/i> respectively. Under the null hypothesis that sample sizes follow the expected split, the Chi-squared statistics<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_529103883\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2020\/12\/variant-formula2.png?resize=250%2C79&#038;ssl=1\" alt=\"chi-squared-statistics-for-E-t-and-E-c-respectively\" height=\"79\" width=\"250\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_661179892\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>has a \u03c7<sup>2<\/sup><sub>1<\/sub>\u00a0distribution. When the observed Chi-squared statistics is too extreme, we conclude the observed sample size does not follow the expected ratio. Note that when SSRM is detected, it does not tell us where the problem is\u2014further investigation is needed to understand the root cause.<\/p>\n<p>At LinkedIn, when SSRM is detected for an experiment, we hide the report by default and show a highly visible warning on our UI.<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2020\/12\/variant-8.png?resize=450%2C20&#038;ssl=1\" alt=\"\" height=\"20\" width=\"450\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_221153473\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2020\/12\/variant-9.png?resize=600%2C270&#038;ssl=1\" alt=\"\" height=\"270\" width=\"600\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_1554154579\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p><i>Sample notifications when SSRM is detected<\/i><\/p>\n<p>Without knowing the underlying cause, SSRM can be difficult to adjust because A\/B tests operate on the assumption that members are assigned to variants randomly according to the split. This makes it critical to not only diagnose the root cause of SSRM, but also apply the corresponding fix. Here we summarize a few typical causes for SSRM.<\/p>\n<ol>\n<li><b>Dynamic targeting: <\/b>Targeting refers to running experiments on a specific member set according to their properties and activities. While most targeting criteria are static for our members (such as country or industry), we also use more dynamic targeting criteria that can change over time. When we run experiments with dynamic targeting criteria, we need to consider whether a treatment itself would interact with the targeting criteria and hence would result in members switching in or out of a targeting criteria at different rates in different variants. For example, let\u2019s imagine that a marketing team wants to send emails to inactive members and test two subject lines in an A\/B test. It turns out that one subject line performs better and brings more members back to the site. As the team continues to ramp up the winning subject line, the team observes SSRM. This is due to the fact that there were fewer dormant members in the winner variant because many had already converted to become active again. The correct analysis approach is to use pre-experiment member properties and fix the targeting cohort in the entire ramping process.<\/li>\n<li><b>Residual effect: <\/b>The residual effect usually refers to the contamination of a former experiment to a subsequent one on the member split. When the treatment effect is large, it may change the frequency at which members visit. As a result, the residual effect may lead to mismatched sample sizes. One such example was with our People You May Know relevance algorithm improvement. In the first ramp, the experiment showed promising results on engagement across the board. However, starting from the second ramp, sample size ratio mismatch started to occur. It turned out that this was purely due to the fact that the new algorithm was so good that it made members more active and come back more often. Such bias can be corrected with re-randomization and the inclusion of members from the very beginning of the experiment in the analysis. However, from our experience, it is rare that treatment itself would be impactful enough to change member re-trigger rate.<\/li>\n<li><b>Biased code implementation: <\/b>Biased code implementation can create similar symptoms as if there is a residual effect. For example, when we re-designed the LinkedIn homepage, the variant evaluation code was implemented in two places: 1) when members hit the router by typing \u2018linkedin.com\u2019 and 2) when members enter the new homepage directly with designated URLs. These URLs can only be accessed by members in the new homepage group (treatment group). The second code call was needed so that if the experiment was terminated, members would lose access to the new homepage with the URLs. After running the experiment, we saw more members than expected in the treatment group after the first ramp whereas including members from the first experiment would resolve the bias. Upon further investigation, we discovered that some members entered the site through the designated URLs exclusively as they had saved those URLs in bookmarks. Because only those in the treatment group would be evaluated in such cases, the treatment group had a higher members count than expected.<\/li>\n<\/ol>\n<p>Because diagnosing SSRM can be time-consuming for the operators of the experiment, we created an automatic tool to run a series of diagnoses and summarize the results and potential root cause. Teams at LinkedIn can receive SSRM diagnostic reports on our experimentation platform with just a few clicks. Because the tool is self-service, it eliminates the need for the platform team to be involved in every SSRM root cause analysis. This greatly speeds up the diagnostic process and helps unblock the ongoing experiments.<\/p>\n<p>An experimentation platform is a complex system with many moving parts, and variant assignment at scale is the foundation for scientific experimentation. By sharing key considerations in how we at LinkedIn have implemented the mathematical principles of A\/B testing into a robust infrastructure, we hope it will help others in their own experimentation journeys.<\/p>\n<h2>References<\/h2>\n<p>[1] Chen, N., Liu, M., &amp; Xu, Y. (2019, January). How A\/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (pp. 501-509).<\/p>\n<p>[2] Xu, Y., Chen, N., Fernandez, A., Sinno, O., &amp; Bhasin, A. (2015, August). From infrastructure to culture: A\/B testing challenges in large scale social networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2227-2236).<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<p><a href=\"https:\/\/engineering.linkedin.com\/blog\/2020\/a-b-testing-variant-assignment\" target=\"_blank\" rel=\"noopener\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Co-authors: Alexander Ivaniuk and Weitao Duan Editor\u2019s note: This blog post is the second in a series providing an overview and history of LinkedIn\u2019s experimentation platform. The previous post on the history of LinkedIn\u2019s experimentation infrastructure can be found here. Introducing variant assignment Previously on the blog, we\u2019ve shared a look into how experimentation works&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2020\/12\/16\/a-b-testing-at-linkedin-assigning-variants-at-scale\/\">Continue reading <span class=\"screen-reader-text\">A\/B testing at LinkedIn: Assigning variants at scale<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[1,7],"tags":[],"class_list":["post-180","post","type-post","status-publish","format-standard","hentry","category-external","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":877,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/11\/unlocking-the-power-of-mixed-reality-devices-with-mobileconfig\/","url_meta":{"origin":180,"position":0},"title":"Unlocking the power of mixed reality devices with MobileConfig","date":"June 11, 2024","format":false,"excerpt":"MobileConfig enables developers to centrally manage a mobile app\u2019s configuration parameters in our data centers. Once a parameter value is changed on our central server, billions of app devices automatically fetch and apply the new value without app updates. These remotely managed configuration parameters serve various purposes such as A\/B\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":817,"url":"https:\/\/fde.cat\/index.php\/2024\/01\/29\/sre-weekly-issue-409\/","url_meta":{"origin":180,"position":1},"title":"SRE Weekly Issue #409","date":"January 29, 2024","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, FireHydrant: It\u2019s time for a new world of alerting tools that prioritize engineer well-being and efficiency. The future lies in intelligent systems that are compatible with real life and use conditional rules to adapt and refine thresholds, reducing alert fatigue. https:\/\/firehydrant.com\/blog\/the-alert-fatigue-dilemma-a-call-for-change-in-how-we-manage-on-call\/ Executing\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":175,"url":"https:\/\/fde.cat\/index.php\/2021\/01\/26\/creating-a-secure-and-trusted-jobs-ecosystem-on-linkedin\/","url_meta":{"origin":180,"position":2},"title":"Creating a secure and trusted Jobs ecosystem on LinkedIn","date":"January 26, 2021","format":false,"excerpt":"Co-authors: Sakshi Jain, Grace Tang, Gaurav Vashist, Yu Wang, John Lu, Ravish Chhabra, Shruti Sharma, Dana Tom, and Ranjeet Ranjan LinkedIn\u2019s vision is to connect every member of the global workforce to economic opportunity. A key driver towards this vision is our world-class hiring marketplace, where we help job seekers\u2026","rel":"","context":"In &quot;External&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":786,"url":"https:\/\/fde.cat\/index.php\/2023\/11\/13\/sre-weekly-issue-398\/","url_meta":{"origin":180,"position":3},"title":"SRE Weekly Issue #398","date":"November 13, 2023","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, FireHydrant: \u201cChange is the essential process of all existence.\u201d \u2013 Spock It\u2019s time for alerting to evolve. Get a first look at how incident management platform FireHydrant is architecting Signals, its native alerting tool, for resilience in the Signals Captain\u2019s Log. https:\/\/firehydrant.com\/blog\/captains-log-a-first-look-at-our-architecture-for-signals\/\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":713,"url":"https:\/\/fde.cat\/index.php\/2023\/05\/15\/sre-weekly-issue-372\/","url_meta":{"origin":180,"position":4},"title":"SRE Weekly Issue #372","date":"May 15, 2023","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place. Learn\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":769,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/09\/sre-weekly-issue-393\/","url_meta":{"origin":180,"position":5},"title":"SRE Weekly Issue #393","date":"October 9, 2023","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Rootly is proud to have been recognized by G2 as a High Performer and Enterprise Leader in Incident Management for the sixth consecutive quarter! In total, we received nine G2 awards in the Summer Report. As a thank-you to our community,\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/180","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=180"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/180\/revisions"}],"predecessor-version":[{"id":211,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/180\/revisions\/211"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=180"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=180"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=180"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}