Correlation vs causation: What's the difference?

Data is an extremely powerful element of any decision. American statistician W. Edwards Deming once said, "In God we trust. All others must bring data." But too often, data can be misconstrued. One of the biggest confusions during any data analysis revolves around correlation vs. causation.

There are countless articles that share wild, often tongue-in-cheek conclusions as a result of two strongly correlated data sets. For example, Harvard Business Review once looked at examples showing the "possibility" that:

  • Spending more to see sports matches reduces your likelihood to consume high-fructose corn syrup

  • More iPhones sold means more people die from falling down the stairs

These are extreme examples. Although, correlation does not necessarily imply causation, and these examples show the dangers of not understanding the difference between correlation and causation in the real world. In these cases, extra vetting is needed before a correlation can qualify as causation.

What is the difference between correlation and causation?

Let's start with the basics. What is the definition of causation versus correlation?

What is correlation?

The Australian Bureau of Statistics provides a great definition of correlation:

"[It is] a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables."

In other words, a change in one variable will typically be mirrored by a positive or negative change in the other.

What are the different types of correlations?

  • Positive correlation: Variables A and B move in the same direction. For example, as Variable A increases, so does B.

  • Negative correlation: Variables A and B move in opposite directions. For example, as Variable A increases, B decreases.

  • No correlation: There is no apparent link between Variables A and B.

Three graphics showing examples of a positive correlation, a negative correlation, and no correlation.

The strength of the linear relationship between two variables, also called the correlation coefficient, can range from -1 (negative correlation) to 1 (positive correlation). The closer the correlation coefficient is to either -1 or 1, the stronger the relationship. On the other hand, a correlation coefficient of 0 indicates that there is no correlation between these two variables.

However, a correlation does not necessarily mean the given independent and dependent variables are linked. This brings us to causation…

What is causation?

Also known as 'causality,' the Australian Bureau of Statistics goes on to define causation the following way:

"...one event is the result of the occurrence of the other event; i.e., there is a causal relationship between the two events. This is also referred to as cause and effect."

In other words, does one variable actually impact the other?

Causation vs. Correlation Examples

Spurious Correlations is an entertaining resource that shares examples that show strong relationships between variables but that are not caused by one another. At least, they should not be.

Case in point: is eating margarine behind Maine's divorce rate?

Chart showing a strong correlation between the divorce rate in Maine and margarine consumption

Source: tylervigen.com (link to license)

Sticking to food examples, could cheese be the secret fuel that powers civil engineers in their studies?

Chart showing a strong correlation between mozzarella cheese consumption and the number of civil engineering doctorates awarded

Source: tylervigen.com (link to license)

Both charts show strong correlations between dependent and independent variables. However, these are likely classic cases of "correlation does not imply causation." That is, unless margarine is indeed a touchy subject for couples in Maine or there are new ground-breaking effects to eating large amounts of cheese.

Why is knowing the difference between correlation vs. causation important?

The correlation and causation examples above show the importance of getting the difference right is critical.

Avinash Kaushik, Digital Marketing Evangelist at Google, wrote in 2016 about how not understanding the difference can be very problematic. Kaushik highlighted an article from The Economist that asserted that eating more ice cream can boost student scores on the PISA reading scale.

"To normal people (non-Analysts), this graph and article looks legit," wrote Kaushik. "after all this is a reputable site and it is a reputable team. Oh, and look there is a red line, what looks like a believable distribution, and a R-squared!"

But Kaushik wants us to think a bit harder about the data at hand, and not take things at face value.

He points out that there is nothing to ground the causation of one and the other despite a reasonable correlation. There may appear to be a link connecting IQ to ice cream consumption. However, the data doesn't definitively reveal anything aside from that obvious correlation.

Making bold claims

In our everyday lives, we have access to more data than ever before. Decisions, opinions, and even business strategies can depend on our ability to tell the difference between them.

Kaushik uses the example above to remind people to be more skeptical of claims that draw bold conclusions from correlated data points. He encourages readers to look deeper at the data and avoid easy decisions.

"Our job is to be skeptical, to dig and understand and poke and prod and to reject the outrageously wrong and if it is not outrageously wrong then to figure out how right it might be so that you can make an educated recommendation." - Avinash Kaushik

Causality vs. correlation is also a topic that Michael Molnar examines in a Forbes article. Molnar warns that:

"Confusing correlation with causation is not an unknown issue but it is becoming increasingly problematic as data increases and computers get more powerful… it gets to the heart of what we know – or think we know – about how the world works."

It can be difficult to infer causation between two variables. Randomized controlled experiences and other statistical tests are often needed to validate if one variable does, in fact, impact another. Moreover, while correlations can be useful measures, they have limitations. As we saw in the correlation vs. causation examples above, it is usually associated with measuring a linear relationship.

Getting correlation vs. causation right

In today's data-driven world, being more skeptical of specific findings before making bold claims, as Kaushik suggests, is essential. How can we do this? Further research and, whenever possible, additional testing.

Outside factors (called "confounders" or "lurking variables") can sometimes come into play for one or two of the variables in a given correlation. For example, some studies found a link between coffee consumption and risk of lung cancer. However, "smoking" has been found to be a potential confounding variable in the results, as one meta-analysis of these findings shows [1]. As mentioned, and as with other key findings, further research can help clarify the context behind correlations.

Testing for causality is challenging. However, experimental design can help. This is where a researcher can test a hypothesis in a way where they can control one variable (the independent variable) and measure its impact on another variable (the dependent variable). Most importantly, it can help them control for possible confounders to avoid potential bias in their results. For more information about how experimental design works, this overview by Britannica provides an excellent introduction.

Approach your customer experience analytics with confidence

At Emplifi, we help top brands measure and elevate the customer experience with proven customer engagement solutions, including Emplifi VoC, our voice of the customer solution. Our VoC experts are also there to help make sense of your data so you get actionable insights you can feel confident about.

See how we can help. Request your personalized demo today.

Additional resources

Below are some great resources that explain correlation vs cause and effect.

[1] Galarraga, V., & Boffetta, P. (2016). Coffee Drinking and Risk of Lung Cancer—A Meta-Analysis. Cancer Epidemiology Biomarkers & Prevention, 25(6), 951–957. https://doi.org/10.1158/1055-9965.epi-15-0727

Editor's Note: This article was originally published on astutesolutions.com. Any statistics or statements included in this article were current at the time of original publication.