Budget-split testing: A trustworthy and powerful approach to marketplace A/B testing

Co-authors: Min Liu, Vangelis Dimopoulos, Elise Georis, Jialiang Mao, Di Luo, and Kang Kang

The LinkedIn ecosystem drives member and customer value through a series of marketplaces (e.g., the ads marketplace, the talent marketplace, etc.). We maximize that value by making data-informed product decisions via A/B testing. Traditional A/B tests on our marketplaces, however, are often statistically biased and under-powered. To mitigate this, we developed “budget-split” testing, which provides more trustworthy and powerful marketplace A/B testing. Read on to learn about the problem, solution, and successful results, using the ads marketplace as a running example. For more technical details, please refer to the paper “Trustworthy Online Marketplace Experimentation with Budget-split Design.”

Problems with marketplace A/B testing

To add some important context, modern online ad marketplaces use auction-based models for ad assignment. Advertisers set an objective, an audience, a campaign budget, and a bidding strategy to each ad campaign. Each “result” (member click, view, etc., depending on the objective) utilizes a portion of the overall campaign budget, for a set duration, until the campaign ends or there is no more budget available. The maximum revenue generated by a campaign cannot exceed its set budget.

When running A/B tests on the ads marketplace, we noticed two types of problems:

When testing a new ad feature, we’d often see a strong metric impact in our experiment, but wouldn’t observe the same level of impact when launched to the entire marketplace.
Many tests required an unacceptably long time to achieve statistically significant results.

The first problem exemplified cannibalization bias, while the second stemmed from insufficient statistical power.

Cannibalization bias
We can illustrate cannibalization bias with a hypothetical example (note: real world manifestations of this bias are less extreme forms of this hypothetical). Suppose that we want to test how a new ad feature (e.g., improving the match between ads and members) impacts ad impressions and revenue. Prior to our experiment, let’s say all ad campaigns were spending 100% of their budgets (i.e., no new feature can increase ads revenue further). If we test our new feature in a traditional A/B test and observe increases in the number of ad impressions, the test would also show a corresponding increase in revenue for the treatment group. Once we launch the feature to the entire marketplace, however, we won’t see that same increase in revenue because (remember) all campaigns were already spending 100% of their budgets. So why did our A/B test lead us to the wrong conclusion?

This happens because the treatment and control groups compete for the same budget. In this example, budget shifts to treatment because it’s performing better. So the revenue “increase” that we observe in treatment simply reflects budget shifting between the groups, rather than a higher level of realized ads revenue.

Insufficient power
Beyond cannibalization bias, marketplace A/B tests that are randomized on small populations (e.g. advertisers) can suffer from low statistical power. As a result, testing velocity is low, which creates a bottleneck for product development.

Solution: Budget-split testing

We designed budget-split testing to solve for cannibalization bias and insufficient power.

First, we randomly split members into two equal-sized groups, with one group assigned to the treatment and the other to control. Then, we split the budget of each ad campaign into two identical “sub-campaigns,” with each sub-campaign getting half of the original campaign’s budget. Finally, we assign one of these sub-campaigns to the treatment member group and the other to the control member group.

The two sub-campaigns act independently on their assigned members, so they can’t compete for budget between treatment and control members. This functionally creates two identical marketplaces, where one has its members and sub-campaigns completely exposed to the treatment, while the other is completely exposed to the control. Directly comparing these two marketplaces measures impact without cannibalization bias. Furthermore, these tests run with a large member population (versus a relatively smaller advertiser population), which improves experiment power.

Implementation

We built budget-split testing with the following principles:

The system must handle common test changes (e.g., turning tests on/off, re-randomization, etc.).
The system must perfectly separate budget between the two sub-campaigns.
The results must be easy to understand and must incorporate our existing business metrics.

The ads delivery system contains two main parts. The first is an ad server that responds to ad requests and controls responses via a bidding/pacing module. The second is a tracking/billing service that tracks ad impressions, clicks, and costs (which are tallied at the campaign level).

We enabled budget-split testing as follows:

In the request handler tier of the ad server, we randomized all requests from a member into either treatment or control, depending on the member ID.
In the bidding/pacing module, we replaced the campaign-level controls with sub-campaign level counterparts.
In the tracking service, we started tracking ad impressions and clicks at the sub-campaign level.

Results

Mitigating bias
We compared the results from a series of budget-split tests with the results from traditional A/B tests that were set up in a nearly identical way and observed a 30-70% difference in measured impact between the two methodologies. Each budget-split test showed a reduction in true impact relative to the traditional A/B test counterpart by member. This confirms our initial hypothesis that traditional A/B tests are far less reliable than unbiased budget-split tests.

Improving power
We also compared the power of budget-split tests with the power of both traditional A/B tests (measured by campaigns) and “alternating-day” tests (a common marketplace testing workaround). Budget-split testing improved test sensitivity by up to 10X. Tests that used to require several weeks now only take 1-3 days.

Conclusion

Budget-split has mitigated cannibalization bias and magnified statistical power in our marketplace testing. This has since unblocked product launches with double-digit impact on member value (e.g., more relevant ads in the feed, a more engaging job seeker experience, etc.) and customer value (e.g., better return on ad spend, higher ROI for job posters, etc.). We hope that readers can derive similar value by applying our learnings to their own marketplaces.

Acknowledgments

We would like to thank Wei Wei and Ishan Gupta for implementation of budget-split test in Ads Marketplace, and Qing Duan, Jerry Shen, Linda Fayad, Giorgio Martini, and other team members from the LinkedIn Jobs Marketplace AI team for the design and implementation of budget-split testing for Jobs Marketplace. Ya Xu, Weitao Duan, Parvez Ahammad, Anuj Malhotra, Le Li, Onkar Dalal, Shahriar Shariat, Yi Zhang, Kaiyu Yang, Xingyao Ye, Yang Zhao, Jerry Shen, Steve Na, Mahesh Gupta, Kirill Lebedev, Mindaou Gu, and Sumedha Swamy for the continued support and helpful discussions. Stephen Lynch, Heyun Jeong, and Hannah Sills for reviews. Finally, thank you to Stacie Vu for the graphics used in the first half of this post.