Predictive analytics and machine learning are the hot topics in advanced analytics at the moment, but sometimes predicting how the KPIs on your dashboard look next week isn't enough. You might want to know how they look if you run a new marketing campaign or what is driving the new trend. First approach to estimating the causal effect of X on Y should be an A/B test but sometimes it is impossible or too expensive. Observational data poses challenges such as measurement error bias, selection bias and omitted variable bias. I will introduce a few techniques that can help you to analyze causal effects even when your boss thinks that perhaps you shouldn't run that A/B test with your website functionality. Difference-in-differences, regression discontinuity and instrumental variables can be helpful but careful thinking is more important than fancy methods.

# When you might need causal inference

Perhaps my favorite thing about being a consultant is the exposure to so many different projects and industries. At the moment, most of the buzz is on predictive analytics and machine learning. And for a good reason: Cheap computation, increased know-how and richer data allow companies to allocate their resources better, whether it means sending their maintenance guys to the right place at the right time or optimal usage of the marketing budget. However, often times we are not only interested in what will likely happen but why? This leads us to the only lesson most people remember from their stats 101: Correlation does not imply causation.

The difference between X predicting Y and X explaining Y is usually very clear but sometimes people tend to get lost in overly complicated models and lose track of the bigger picture. The number of police patrolling a neighborhood will probably predict crime rate very well but it would be a long shot to claim that police cause crime (at least in Finland). If your company is in the business of selling home security systems, you might not really care (your conscience probably should) about spending time in digging deeper into the root causes but if you’re trying to solve the problem you need to dive in.

# What causal effects are and how they differ from prediction?

Causal effect of X on Y is basically the difference between Y in a world where X did happen and Y in the world where X did not happen. The obvious problem (and the fundamental problem of causal inference) is that we’ll never actually see the difference and hence we need to estimate it. The gold standard for estimating causal effects is a randomized experiment (or A/B test) and whenever you need to estimate causal effects, it should be the first thing on the table.

Sadly, many times this is either impossible or just infeasible. Consider a situation where your sales team lead asks for money to hire two new employees stating that 15% of their contacts lead to conversion so this would be a great investment. You might consider the number impressive but does it mean that the sales team is really making conversions happen or finding the people who would have made the purchase in any case? If the contacts were chosen at random there would be a good case to hire more people. However, if the high conversion rate is achieved by cream skimming the most likely conversions there might not be a sufficient case to extend the team. The truth is probably somewhere in the middle and figuring out where “exactly” is usually not a trivial task.

What makes causal questions so different from pure prediction is that they can’t be computed from data alone. Some statisticians would go as far as claiming that there can “no causation without manipulation”, which is probably right in a strict sense but in practice there are cases when we get “close enough” to draw useful insights. In a controlled experiment we take a sample, divide it into groups and expose one of the groups to a treatment. So basically we are trying to get as close as possible to a situation where we can observe the “subjects” in two parallel worlds and measure the difference.

# Problems with observational data

When we are trying to estimate how X affects Y from observational data, one naturally considers regressing Y on X and interpretting the coefficient as the causal effect. In many cases this is will give you correct estimates (if the model really is linear) but there are common issues that can bias your results. There could be a variables that affect both X and Y like the average income and age distribution of the neighborhood in the police crime rate example. We could be measuring X with an error or try to estimate the demand function from data that is generated from interplay of supply and demand. These problems are called omitted variables bias, measurement error bias and simultaneity bias.

In order to carry the intuition over different problems and techniques let’s consider a simple example of a national chain of ice cream trucks where the management wants to understand how the price (X) affects sales (Y). An example of an experimental approach would be to charge three or four euros at random from each customer and see how they take it. If we assumed that this would have no effect on who turns up in the first place and no behavioral response, we would uncover the causal effect of changing the price from three to four euros. Obviously, there would be people who wouldn’t make the purchase because you’re treating them unfairly and the experimental design would not work.

If there have been changes in pricing, one might be tempted to regress the logarithm of quantity sold against the logarithm of price to estimate the price elasticity of demand. When considering this approach, the analyst need to be careful. Are there omitted variables? Are our measurements precise? Do we observe the demand directly? For example, are we giving discounts during winter or do we advertise more when the demand is low?

Where I am trying to head is that if the variation in prices is not random, we can’t directly estimate the effect of prices since we are seeing a combination of multiple effects in the data. Accounting for all of them to the price effect will give us a biased estimate of the causal effect of price on sales. For example, if we increased ice cream prices in June and naively estimated the change in sales we might conclude that a higher price increases demand. This is a safe case since we'd immediately realize that something went wrong. If the total effect had been really small we could conclude that the demand is inelastic and drastically increase prices only to see our sales plummet. In some cases, we might be able to control for the confounding variables such as weather in the case above. In others, we aren’t so lucky but sometimes we can use special methods to get the correct estimates anyways.

# Difference-in-differences (D-i-D)

D-i-D is a technique commonly used in applied econometrics that tries to apply experimental research design principles to observational data. The basic idea is to find a group that behaves just like the experimental group and use it as the control group. Going back to the ice cream example, instead of giving random prices to customers of the same truck, different pricing could be applied to trucks in different geographical regions. If the customers in different markets are similar and can’t buy from the other markets, we are very close to an experimental design. However, if the markets are too far apart we are more susceptible to bad control group (for example weather could be different) and if the markets are too close people just buy their ice cream from the next block.

What we need for DiD to be a useful method for a given problem is that without a change in a variable of interest (for example price), two groups would have parallel trends and that the only difference is the treatment. If ice cream sales increase rapidly in spring and we change the price we can’t really tell two effects apart. But if sales increase at the same pace in Helsinki and Turku and we only change the price in Helsinki we can use vans in Turku as a control group. For a more detailed treatment read this.

Applicability of the D-i-D is not restricted to pricing. We might be interested in measuring the success of a regional marketing campaign or a change in the layout of our departments stores. What we need is a question we need to answer and two or more regions / segments / etc that are exposed to the same outside effects.

# Regression discontinuity (RD)

What’s biggest difference between a child born in the 31/12/2009 and 1/1/2010 or someone who just receives a scholarship and someone who just doesn’t based on a test score? RD is a technique that uses "almost random assignment" to treatment and control groups. Suppose that the ice cream company has collected information about their customers and wanted to send discount coupons to those who haven’t made a purchase in the last two months. It chooses 1000 customers whom used to buy the most and sends them an offer. The probability that a customer who used to buy often to return is probably a lot higher than the probability of an average customer to return. Hence comparing the two groups doesn’t give us a good estimate of the success of the campaign since the participants were not chosen randomly.

However, if we compare 1000nd and 1001st on the list, there’s either no or practically no difference in their earlier behavior. When we increase the “window size”, the difference between two groups grows as well and what is “good enough” can be a tough call and it should be examined carefully. For example, 100 customers on both sides of the cutoff could be practically the same and RDD would be as good as experimental design. However, this design might not work if we used more observations since the groups were too different. Since RD design is focused around the cut off, this is also a source of problems. If the data is sparse around the cut off we might be able to use RD and estimates obtained from around the cut off might not generalize over the population.

RDD can be used cases where the selection into “treatment” is done by some underlying variable (purchases last year, expected consumer lifetime value, credit score, age etc.) and the groups on both sides of the cut off are practically the same. RDD has a lot potential for example in evaluating marketing campaigns. Neat examples can be found here.

# Instrumental variable (IV)

The first application of an IV was to estimate supply and demand equations. We do not observe either one directly since the resulting scatter plot of price and quantity combinations is a result of an interplay of two effects. We could estimate the shape of the demand curve if we could see supply moving alone. IV does precisely this. It is a variable that affects supply but does not affect demand.

Even though the use of instrumental variables grew from the effort to estimate simultaneous equations, their applicability is more general. IVs are commonly used for estimation when we have problems the omitted variable bias. A classic textbook example is the estimation of (monetary) return to education. Or n plain English, how much does education affect wages. We can see from the data that educated people earn more on average but is this because “talented” people study more or because education increases earnings? We don't really see this "talent" and we expect that it increases both earnings and schooling. IV would be something that increases schooling but does not effect earnings directly such as distance from home to the closest university.

More generally, suppose we want to estimate how X affects Y but we know that there is a confounding variable C that affects both X and Y. IV is a variable that affects X but does not affect Y directly. Basically we are using the variation in Z to see when changes in X do not occur because of C and then use this to estimate how X affects Y.

The concept of the IV is elusive at first, but they are super powerful in causal analysis. Suppose that our favorite ice cream truck chain has decided to digitalize and open a website where they can pre-order their ice cream in order to increase sales. We want to model the effect of pre-ordering on order size. If we want to use an IV approach, we need a variable that affects pre-ordering but does not directly affect the order size. If we have advertised the new service to some of our customers, the campaign would a good candidate. (Or we can create one).

A much better treatment of instrumental variables can be found here

# Closing words

In my opinion, causal analysis is a lot more about careful thinking, throughout understanding of the problem and design than fancy methods. Understanding the most common pitfalls can save you from false conclusions and above techniques can help you estimate the causal effects when you can't perform A/B tests.

Finding good RD designs or IVs can be challenging. Just like with everything else, you see IVs every where until you actually need one. Sometimes you can make unexpected discoveries when you turns things upside down and start thinking what you could do with the IV you found.

If you're interested in learning more on your own, consider reading this book: http://www.mostlyharmlesseconometrics.com

Do not hesitate to reach out with comments and feedback! (ville.suvanto@bigdatapump.com)