Outlier
Adapted from Wikipedia · Discoverer experience
In statistics, an outlier is a data point that differs a lot from other observations. It might happen because of changes in how something was measured, because it shows something new, or because of a mistake in an experiment. Sometimes these special points are left out of the data set.
Outliers can appear by chance in any group of data, but they might also show new patterns, mistakes in measurement, or that the data has a special kind of spread. When there are mistakes in measurement, people might want to remove them or use special ways of analyzing the data that are robust to outliers. Sometimes outliers tell us that the data has a very wide spread, and we need to be careful when using normal tools that assume a normal distribution.
In most bigger groups of data, some points will be far away from the middle of the data. This can be because of small mistakes, problems with the ideas used to create the data, or because some observations are just far from the rest. These outlier points might show problems in the data, mistakes in how the data was collected, or places where a certain idea might not work well. However, in large groups, a few outliers are normal and not because of anything unusual.
Outliers, being the most extreme points, can be the highest or lowest values in the data. But the highest and lowest points are not always outliers because they might not be unusually far from the rest of the data. If we look at statistics that include outliers without care, it can give us the wrong idea. For example, if we are finding the average temperature of 10 objects in a room, and nine are between 20 and 25 degrees Celsius, but an oven is at 175 °C, the median of the data will be between 20 and 25 °C but the mean temperature will be much higher. In this case, the median shows us better what a typical object might feel like, while the mean can be misleading. This shows how important it is to use the right tools, like the median, which is a robust way to understand the middle of the data.
Occurrence and causes
In statistics, an outlier is a data point that is very different from the others. For data that follows a normal pattern, about 1 in 22 points will be more than twice the average distance from the middle, and 1 in 370 will be more than three times that distance away. In a group of 1,000 points, having up to five points this far away is normal. But in a smaller group of 100 points, even three such points can be a reason to look closer.
Outliers can happen for many reasons. Sometimes, the tool used to collect the data may have a temporary problem. There might be an error when the data was written down or sent. Outliers can also come from changes in what is being measured, mistakes made by people, or natural differences in the group being studied. Sometimes, an outlier might show that the usual way of understanding the data needs to be checked more carefully.
Definitions and detection
There isn't a strict rule to say what makes something an outlier. It's about whether a piece of data stands out a lot from the others, and this can be a bit personal to decide.
There are many ways to find outliers. Some use drawings like normal probability plots, while others use math ideas. Box plots mix both.
One common math way looks at how far data points are from the average, using something called the mean and standard deviation. Some of these methods include:
- Chauvenet's criterion
- Grubbs's test for outliers
- Dixon's Q test
- ASTM E178: Standard Practice for Dealing With Outlying Observations
- Mahalanobis distance and leverage
Another way looks at ranges, like the interquartile range. For example, if you know the lower and upper parts of your data (quartiles), you can say anything outside a certain stretch from these is an outlier. John Tukey suggested using a special number (k) to decide this stretch. If k is 1.5, it marks an outlier; if k is 3, it's something very far out.
Outliers can show us something new or unusual, but they can also mess up our math when we study groups of data.
Working with outliers
When we find a data point that is very different from others, we need to think about why it is different. Sometimes, this difference is because of a mistake in how the data was collected. In those cases, it might be best to fix the mistake if we can, or even remove the data point.
However, just removing a data point because it looks different is not a good idea. This can change the results of our analysis in ways we might not expect. Instead, we should use methods that can handle these different data points without removing them.
Sometimes, the data we are looking at does not follow a normal pattern, which means we might see more unusual data points than we expect. In these cases, we need to use special methods that can still work even with these unusual points.
If we decide to remove a data point, it is important to clearly say why we did this in any reports we make. There are different ways to handle unusual data points, such as replacing them with values that are closer to the rest of the data.
Related articles
This article is a child-friendly adaptation of the Wikipedia article on Outlier, available under CC BY-SA 4.0.
Safekipedia