The Danger of Extrapolation in Regression Analysis

Regression analysis is a valuable tool in statistical analysis primarily because it allows analysts to predict, or regress as we prefer to call it, variables from sets of other variables. This method is one of the technique utilized in predictive analytics. Predictive analytics is a powerful arsenal to have in most scenarios as it allows users to envision what an outcome might be based on several inputs derived as a mathematical model.

As the prevalence of predictive analytics has seen a surge in popularity with the advent of things such as big data, I believe that the topic of extrapolation merits some attention. Oxford Dictionaries defines extrapolation as “extend the application of (a method or conclusion) to an unknown situation by assuming that existing trends will continue or similar methods will be applicable” with assuming being the key word here. In mathematical terms, Wikipedia refers to extrapolation as “the process of estimating, beyond the original observation interval, the value of a variable on the basis of its relationship with another variable” with beyond the original observation interval being the key phrase. Therein lies a problem.

In my experience, I have encountered extrapolation being used to predict values for which the mathematical models do not support. Frequently, my audiences can be confused by my initial reaction towards not using extrapolation.  This experience is an anecdote by the way. I do not have any statistical proof of it. However, I am certain that many of you in the statistical field would identify with this scenario.

An Example using Natural Data Set

Let us consider a simple example, as shown in the chart below, showcasing the correlation between height and weight of a certain sample taken to represent a certain population. For the purpose of this example, let us assume that all the proper statistical techniques of population sampling and hypothesis tests have been correctly undertaken. In a real example, we of course do not want to make all these assumptions; we would actually conduct all the necessary statistical procedures. However, showing that would divert us from our topic of extrapolation.

Height and Weight Correlation

As we can see from this chart, the mathematical model, denoted by the function f(x) = 0.90x + 107.89 can be fairly used to predict height from weight. In statistics, we know that this mathematical model only holds true for weight values between 45 kg (the minimum) and 94 kg (the maximum) from the data set. Thus, when predicting height, we can only use the formula for those weight range. Any height predicted beyond the minimum and maximum value, namely less than 45 kg and higher than 94 kg, using the mathematical model will not be guaranteed to hold true.

This can be seen in the equation. In the real world, if weight is 0 kg, it necessitates us to conclude that there are no subjects being measured, and therefore height can be input as 0 cm. However, if we apply the equation above, with x equals to 0 kg, we would get the height value f(x) equals to 107.89 cm. This is of course illogical. No subjects can have zero weight but still have a height of just over a metre. Similarly, on the other end of the spectrum, if we were to predict height using a weight value of say, 120 kg, you would get the value of 215.89 cm. This would be hardly true in the real world as well.

This is not an error of the mathematical model. It is simply due to the fact that our sample data has only been collected for weights between the two ranges mentioned above. Due to that, the mathematical model only holds true for those ranges. To predict values beyond them, we would need to gather samples that span other weight ranges, perhaps by increasing the sample size. Once we have those data, the correlation may look entirely different, it may not even be linear anymore for wider weight values.

The example above shows the danger of extrapolation using a natural data set, that of height and weight. It does not seem dangerous enough. However, in other settings, like businesses, it can have crippling effects.

A Practical Example

Consider the chart below, which shows a correlation between volumes sold of a particular product which have slight price differences in different regions, along with the sales revenue that they bring. Again, let us assume that all the proper statistical procedures have been done.

Sales and Volume Correlation

The chart seems to show a near perfect correlation. These are the kinds of charts that may make audiences filled with excitement. They may conclude that because this chart clearly shows how well volume correlates with sales, then increasing volume further beyond the data set would definitely increase sales as well. They would use the mathematical model (the equation) to predict their sales pipeline and present this figures as part of their sales strategy. Therein, lie the danger in businesses for utilizing extrapolation techniques.

Just like in the previous example, the predicted sales value cannot be guaranteed to hold true for volume values (that’s quite a tongue twister) less than 25,385 units and beyond 77,615 units. If we were to think logically without the aid of this chart, we can theorize that a higher volume might be disastrous. We have no way to guarantee that customers would increase their expenses in this product. It may introduce an increase in supply which may lower down demand, and actually reduces sales.

However, I am amazed that such mistakes involving extrapolation are often made despite these logical observations that can be made without the use of any statistical charts.

What Can Be Done?

As data scientists, or statisticians, or business analysts – or whatever we call ourselves – it is our role to insist and convince our audiences why extrapolation is dangerous. Granted, in my experience, there can be two outcomes from this. One, I would be given some time to gather more data in order to come up with a better mathematical model that suits the audiences’ requirements. Two, I would not be given such time at all due to urgency, and therefore would be asked to provide estimation. In either case, my audiences always want what they want.

If you get the first outcome, you are lucky. Hopefully, that allowed time would be enough for your further analysis. If you get the second outcome, this is where things get tricky. You can try to somehow convince your audiences that such estimates cannot be guaranteed to hold true using facts and real-world observations which would hopefully lead you to the first outcome. This would require your negotiation finesse. Alternatively, you may try to provide some form of estimates without resorting to blind guesses. Some researchers do try to analyse reliable extrapolation techniques with lesser confidence intervals, like in this article.

However, I would advise that as much as possible, avoid extrapolation. I would rather be highly confident with a mathematical model than having extrapolation estimates with lesser confidence.

Conclusion

In the examples above, I utilize both examples using simple linear regressions. It is worth noting that the same principles hold true for all types of regression techniques. The bottom line is, only predict values for which the mathematical model supports. Never go beyond the values from your data set.


Leave a Reply