Detecting anomalies is a crucial aspect of any company’s operations, achievable through various methods, including outlier analysis. In the realm of statistics, outliers can dramatically skew results, particularly when dealing with small sample sizes, thereby affecting averages and influencing the conclusions drawn. Understanding and managing these outliers is an essential step in data analysis.
Definition: Data outliers
Data outliers are values that differ significantly from the rest of the dataset. They represent information on out-of-the-ordinary behavior, such as numbers far from the norm for the variable in question.
Depending on the overall context of the data, outliers can signify an error in data collection or measurement or interesting anomalies such as rare occurrences or extreme values.
According to Roshandel, M. Reza, 2022, data outliers are problematic in statistics because they:
Different types of data outliers
There are various data outliers, but broadly speaking, they can be categorized into three types:
These are the most basic kind of data outliers. They are data points that are significantly different from other points in the dataset.
These types of data outliers can occur due to measurement errors or represent extreme values in the population. Existing outlier detection methods often focus on finding global outliers.
These data points are not considered when in isolation but become outliers when contextual factors are considered.
Without relevant background information, it can be challenging to identify contextual outliers in a given set of data. For this reason, it’s essential to have a description of the context at hand while looking for them.
These are groups of data points that are significantly different from the other groups in the dataset. These types of data outliers can occur when there are subgroups or clusters of data with different characteristics or errors in the grouping or categorization of data.
True and false data outliers
Data outliers can occur due to different reasons, and it’s important to distinguish between true outliers resulting from natural variation in a population and outliers arising from measurement errors or incorrect data entry.
Outliers resulting from natural variation in the population are referred to as true/real outliers. These data outliers can provide valuable insights into the data distribution and the population’s characteristics under study.
In contrast, data outliers arising from measurement errors or incorrect data entry are known as false/spurious outliers. These data outliers can distort the distribution of the data and can result in incorrect conclusions or analyses.
When dealing with many outliers or a skewed distribution, it’s essential to consider what caused them.
If the outliers are due to measurement errors or incorrect data entry, they should be corrected or removed from the dataset.
However, if the outliers are due to natural variation in the population, they should be retained in the dataset.
Finding data outliers
Detecting data outliers is essential in data analysis as it ensures data quality, statistical significance, and accurate modeling. There are several techniques for finding data outliers, each of which takes a slightly different approach.
Sorting your data
Sorting your data is an easy way to identify outliers because it allows you to see the extreme values at a glance. You can sort your data in ascending or descending order, depending on the variable of interest, and then check for abnormally high or low values.
Graphing your data
Graphs like scatter plots and histograms can also be used to identify data outliers. Graphs provide a visual representation of your data, making it simple to highlight patterns and spot outliers.
Outliers will manifest as points or bars that display a notable deviation from the rest of the dataset. A scatter plot displays data points as dots on an x-y graph, depending on two variables.
Since most points tend to cluster together, scatter plots make it simple to spot outliers. The extreme number is the outlier.
In contrast, histograms use bars to organize data in ranges. The data ranges are shown along the x-axis, while the other variable is along the y-axis. This helps in locating outliers in the data. For example, if most data points fall into the right-hand side of the graph, but one of the bins sits on the extreme left, that left bin stands out as an anomaly.
A Z-score represents the number of standard deviations a data point differs from the mean of a dataset. Z-scores are obtained by taking a data point, subtracting the mean, and then dividing by the standard deviation.
The formula for calculating a z-score is:
z = (X – μ) / σ
- z = z-score
- x = data point
- μ = mean of the dataset
- σ = standard deviation of the dataset
If the Z-score for a data point is significantly greater or lower than 0, then that data point is considered an outlier.
Using interquartile range
The interquartile range (IQR) is a measure of variability that tells how spread out the middle 50% of the dataset is. To identify data outliers using the IQR method, a general rule is to consider any data point that falls more than 1.5 times the IQR below the first quartile or above the third quartile as an outlier.
Q1 and Q3 denote the first and third quartiles of a dataset and the quantitative difference between the upper and lower quartiles of the dataset.
The Tukey method, also known as the Tukey fence method, is a variation of the IQR method that uses fences to identify data outliers.
A fence is a threshold value beyond which any data point is considered an outlier. The fences are calculated by adding or subtracting a multiple of the IQR from the first and third quartiles. Any data point that falls outside of the fences is considered an outlier.
The fences are calculated as follows:
- Lower fence = Q1 – (1.5 x IQR)
- Upper fence = Q3 + (1.5 x IQR)
Hypothesis tests such as Grubb’s and Peirce’s criteria are more advanced methods for identifying data outliers. They are typically used when the data is suspected to be from a normally distributed population.
These tests are useful when the dataset is large, and the outliers are challenging to detect using simple methods like sorting or graphing the data. It’s crucial to exercise caution when using these tests because they presume that the data is normally distributed, and may not be applicable to data sets that have other types of distributions.
Removing data outliers
Detecting and understanding data outliers is essential in statistics for several reasons:
They can significantly impact statistical analyses, as they can skew the results and make them less accurate.
Data outliers can provide important information about the data-generating process, such as errors in measurement, data entry, or data collection.
They may represent rare or extreme events of particular interest, such as anomalies in financial data.
It’s important to note that removing data outliers can also remove valuable information from the dataset. In some cases, data outliers represent values important to the analysis.
When to and not to remove a data outlier
Removing data outliers from a dataset can improve the accuracy of statistical analysis and prevent misleading results. However, whether to eliminate an outlier depends on whether it’s representative of your study’s population, topic, research question, and approach.
If the outlier is:
- A data entry or measurement error, you can correct it. However, if the error cannot be fixed, removing it is the best solution.
- Not representative of the population under study (i.e., having exceptional traits or situations), it can be legitimately excluded.
- A natural part of the population you’re studying, you shouldn’t get rid of it.
How do you remove outliers from your dataset?
Some of the common methods for removing data outliers are:
Using standard deviation:
This method is suited for Normally/Gaussian distributed data. It involves taking 3 standard deviations from the mean of the values to calculate the upper and lower boundary.
Interquartile range (IQR) method:
It involves calculating the IQR of a dataset and removing any data point that falls outside the range of (1.5 x IQR) away from the first or third quartile.
This method involves visually inspecting a plot of the data and identifying any points that are far from the other points.
It’s important to note that these methods may not always be appropriate. Therefore, you should consider the context of the analysis and the study’s goals before retaining or removing data outliers.
Data outliers are not always bad, but they can indicate errors or unusual patterns in the data that may warrant further investigation.
Preventing outliers in data can be challenging, but some measures can be taken, such as:
- Improving data quality
- Using appropriate measurement techniques
- Setting realistic thresholds for data values
For skewed datasets, the best method for removing the outliers is the IQR method. However, if the data is normally distributed, which is often the case, it’s best to use the standard deviation.