LEVERAGING A/B TESTING TO DRIVE BUSINESS DECISIONS
By Jude Ndu
Many times we come across words such as A/B testing and one will be wondering what it actually does, what role it has to play in the data science field, and how one can be utilized to propel businesses towards achieving their set goals. In this article, we will walk through the applications of A/B testing and how they can be implemented.
What is A/B testing? A/B testing in simple terms means comparing the variance of two groups to find out which is better. In a broader sense, it is also known as split tests, which are online experiments used to test potential improvements to a website or mobile applications. In the case of a website, two versions of the website are shown to different users- usually the existing website and a potential change. Then the results are analyzed to determine whether the change is an improvement and worth launching. One can test just about anything, which can include, button size, font, color, sales ad, call to action, to name a few. Also, consider a company that produces mobile game applications, and they wish to upgrade their game to a newer version by introducing additional features. In a business context, the producers do not need to enforce this change on their users without having to get feedback concerning how users are reacting to the latest version, else this could lead to a potential loss of users or players churn. And as we know this can be quite a very bad record for a company using such an application to generate revenue. Hence running an A/B test is significant here to help uncover users’ behavior towards such an application.
Steps to Running an A/B Test
As a data scientist or product analyst, it would be imperative to take enough time to plan your A/B testing by first figuring out what you want to test, but in doing so, always remember to place the business problem first. What do I mean by this? No company would want to exhaust all of their entire resources on a task that generates an insignificant result. This, therefore, requires going through the elements of your product to ascertain which is most likely to boost sales or in affiliation with the problem you are trying to solve, once you are certain it’s significant, then that should be tested.
Secondly, ensure that your product has the ability to constantly generate data within a short time window. Consider a website that sells cars. Do you think it’s a good fit for an A/B test? Now car isn’t something people buy very often which may not cause enough traffic to generate sufficient data. It can also lead to skewed data with a long-lasting experiment and no one would want to spend their entire time running an experiment that is less likely to give a reliable result. Therefore, the condition to run an A/B test on such a website is not feasible.
Thirdly, set up your hypothesis for result testing, but before this, one needs to figure out the exact metrics that need to be evaluated on these hypotheses. Let us take Google for an example. They wish to optimize their web pages by running an A/B test (which isn’t far from the obvious). The goal could be to find out how often users find certain buttons on their web pages by clicking them. And this can be achieved as quickly as possible by getting them down into a funnel or path, either by helping them find exactly what they are looking for or triggering them to click these buttons, which could be a purchase button, sign up button, etc. However, these sets of events are captured and saved to be used as a metric.
How to Define our Metric
Note that, to define your metric here is dependent on the problem you are trying to achieve, but in the case of this website, we start by capturing the total number of visitors on the web page and the total number of users who clicked on a button between the control group(A) and the experimental group(B) after users have been randomly assigned to each group.
How to Calculate our Result
Once the metrics are set, then we can compare these two groups by generating the estimated probabilities of users who performed these sets of actions on the webpage. We define thus:
p̂con = Xcon ∕ Ncon (estimated probability for the control group-A)
p̂exp = Xexp ∕ Nexp (estimated probability for the experimental group-B)
Where “Xcon” and “Ncon” are the total number of users who clicked on a button in the control group and the total number of users who visited the web page in the control group respectively, the same definition applies to the experimental group. The next thing for us to do is to calculate the estimated difference in probabilities between these two groups. In a hypothesis test, the estimated difference by virtue leads us to either accept or reject the null hypothesis, which could suggest that the population means differ or one is larger than the other, that is, the sample mean difference is statistically significant. We define thus:
d̂ = (Xcon ∕ Ncon ) − (Xexp ∕ Nexp ) = p̂con − p̂exp
The general rule of thumb for accepting or rejecting hypothesis are:
- When the estimated difference equals zero( d̂ =0) or the population mean for both groups is equal (μcon = μexp). We accept the null hypothesis(H0) which means there is no effect in the distribution of both groups. Which is a standard notation for normal distribution and as such, not statistically significant.
- When the estimated difference does not equals zero ( d̂ ≠0) or the population mean for both groups is not equal ( μcon ≠ μexp). In this case, we reject the null hypothesis and accept the alternate hypothesis(H1), thereby going with the conclusion that there is an effect in the distribution of both groups which is statistically significant.
After our hypothesis testing, the obvious next thing for us to calculate is how much larger is this difference. This leads us to the confidence interval which is basically used to estimate the size of the population mean difference. In other to calculate the confidence interval, we will need to calculate the estimated pool probability (sum of the probability ratios of the control and experimental group), standard error, and margin of error. We define thus:
̂Ppool = Xcon + Xexp ∕ Ncon + Nexp
S.E = √ ̂Ppool (1 – ̂Ppool ) ×(1 ∕ Ncon + 1 ∕ Nexp)
M.E = S.E × Z score ( here the value of z-score is gotten from the statistical table at a specified confidence level ).
Now we can define our confidence interval for the lower and upper bound range of values. The lower bound value is given by the estimated difference minus the margin of error, that is d̂ – M.E, and the upper bound is given by the estimated difference plus the margin of error, that is, d̂ + M.E
Interpreting our Result
Once our results are set, our next task would be to analyze, interpret our results and give recommendations based on our results. Once our hypothesis test has shown our result is statistically significant. Great! But not so fast, having a statistically significant result does not necessarily mean that the result is practically significant in the real-world sense of importance. This is where the big player comes in, the confidence interval, that provides us with this range of values for a quick comparison. Ideally, we aim to compare our practical significant value with the statistically significant value which provides us with a decision boundary that can be set as a benchmark for evaluating our result, thus leading us to make statistical decisions that can help drive our business goals. This is achieved by comparing our result lower bound values (d̂ – M.E ) with our practical significant value, which tells us how much of a change we have in our result and whether if the conditions at that significant level are met. While interpreting A/B test result there are:
- Cases where we have a positive outcome at end of our test result, which confirms our hypothesis is significant and proven within our confidence level. In this scenario, it is a big win! and okay to launch our new product.
- Cases where our test result shows a negative outcome. This could mean our hypothesis is significant but not proven within our confidence level of significance, indicating the control group as still the winning solution. So in this scenario, it is still a big win, as there wouldn’t be any need to launch the new product, but rather retain the original solution.
- In the final case, we have a neutral-like result. This is mostly regarded as an inconclusive experiment and the best recommendation given in such a scenario is to source more data for an additional test.
Conclusion
Following how digitalized the world has grown, it is almost impossible to imagine running a business without an A/B test, owing to the widespread use of the internet today, which is the epicenter of all businesses. However, despite the A/B test solving most of the business challenges, we should not always expect to get a good result as it can also present its own challenges, including the time spent in running your test, which can take about a few weeks or more, depending on your problem.
We have gone through all the necessary steps on how to implement an A/B test. In summary, we kick-started with our metric selection, which is the most important aspect in setting up an A/B test. Then running a hypothesis test on our metric to test their statistical significance, which led us to the confidence interval to estimate how much of an effect we have in our result. We then went further to calculate our confidence interval’s lower and upper boundaries that explained whether this effect is meaningful in a real-world sense.
Thanks for reading!
Reference
https://classroom.udacity.com/courses/ud257/lessons/4018018619/concepts/40043987120923