In practice, there are issues for the solution of which it is required to find anomalies in the numerical series. For ease of understanding, we can assume that these are values that differ from most numbers in the series in some way (outlier, non-standard value, deviation from the norm). Such tasks are found in various areas:
- cleaning of noisy data in Data Science;
- outlier filtering in the training sample for neural
networks in Machine Learning;
- search for abnormal network hacker activity, while
monitoring traffic and events in Cybersecurity;
- detection of outliers or tails in the stock data
stream in Algorithmic Trading;
- as well as in any anomaly search tasks, where data can be presented as a numerical series.
The concepts of a number series in mathematical analysis and in statistics are different. We accept a numerical series as its statistical understanding, that is, a finite sequence of numbers (analogous to a sample). There are various interpretations of the anomaly in the numerical series. We will consider them further.
The article also shows examples of how to find anomalies quickly and efficiently in numerical series using the modified Hampel method (Hampel F.R.).
The concept of the number series anomaly
For one-dimensional
data, outlier estimation is extensively researched in the statistical
literature. Traditionally, the most useful estimates for characterizing the
variance of data are the sample
mean and variance of the sample. They give a good estimate of the
location and scatter of the data in the sample, but only if the outliers
"do not pollute" it. Even if the data contains only one observation
that is significantly different from the rest in the sample, then the sample
mean can also deviate significantly from the sample mean without this outlier.
To measure the
robustness of a chosen statistical estimation method against outliers, Hampel
introduced the concept of a breakdown point. The breakdown point is the largest
percentage of "polluted data" (the proportion of incorrect
observations or outliers) that the chosen method can process before it starts
to produce an incorrect result. These can be arbitrarily large aberrant
(abnormal, erroneous) values.
Intuitively, one can
understand that the breakdown point cannot exceed 50%, because if more than
half of the observations are "polluted", then it is impossible to
distinguish the main distribution of the series from the polluting
distribution. Therefore, the maximum outlier fraction for the breakdown point
is 0.5. There are examples of statistics that reach the maximum value for the breakdown
point. For example, it is 0.5 for Median. And for the sample mean, it is equal to 1/n, where n
is the sample size.
It is generally
believed that the higher the breakdown point value, the more reliable the
statistical evaluation method. A statistic with a high breakdown point value is
sometimes called a robust (resistant) statistic.
To estimate more
reliably the location and scatter of sample elements, it is often recommended
to combine the Median and the Median
Absolute Deviation (MAD). It is they
that underlie the method of filtering the sample for outliers according to
Hampel [1].
The MAD value is
calculated like this:
MAD (X) =
Median ( | x1 − Median (X) | , …,
| xn − Median (X) | ),
(
1 )
where X is
a sample of n observations x1, …, xn.
According
to Hampel, a sample outlier is a number x from a sample X, the modulus
of the difference of which and the sample median is greater than the MAD of the
sample, calculated by formula (1) and multiplied by a special coefficient k (scale factor), depending on the distribution (≈1.4826
for normal) [2].
In
practice, to build a Hampel outlier filter, this definition is modified using sliding windows and the parameter s (sigma), a threshold number of standard deviations. This means that the median
and MAD are calculated for all sliding windows consisting of the elements of
the sample. Further, all modules of the difference between the elements of the
sample and the median are compared with MAD, multiplied by the coefficient k
and multiplied by the parameter s [3]. A higher standard deviation threshold s
makes the filter more sparing; a lower threshold identifies more numbers as
outliers.
Now let's
define the basic concept: what we mean by an anomaly in a finite number series
for practical issues.
Definition 1. The anomaly of a number series is a number x from the
series X, which modulo differs from the median of the sliding window by more
than s times from the MAD of this window, multiplied by the coefficient k.
To
generalize the anomaly filter as much as possible and abstract it from the practical issue being solved, we propose the following set-theoretic interpretation.
All
anomalies of the series form some of its subset A:
A = { a | |
a − Median (Xi) | > s ∙ k ∙ MAD (Xi) },
( 2 )
where a is
an anomalous element from X calculated according to Definition 1, X is the
initial number series, Xi is the number series of the
i-th sliding window, total windows: (n − w + 1), where w is the window size.
The anomalies filter of the number series is defined as a function:
F:
X → { True, False }. F (xi) = True, if xi ∈ A; False, if xi ∉ A. (
3 )
Here X is
the original number series consisting of n elements xi, A is
the set of anomalies (2).
Filtering the number series for anomalies by the Hampel method
To filter a number series for the presence of anomalies in it, it is proposed to use the Python implementation of the HampelFilter() function. This function allows you to detect anomaly among the values of any number series using the Hampel filtering method. The filter detects anomalies determined according to (2). Configurable parameters in this filter function: window (variable w from the formulas above) — size of the sliding window (5 by default), sigma (variable s) — threshold number of standard deviations (3 by default), scaleFactor (variable k) — special coefficient, depending on the distribution of the series (1.4826 by default).
The
HampelFilter() filtering function returns a new series F, according to
(3) consisting of True or False values, where True means the presence of an
anomalous element at the corresponding position in the original series.
Input examples and results, HampelFilter(window=5, sigma=3,
scaleFactor=1.4826)
Input data Function output
[10, 10,
10, 10, 10] [False, False, False, False, False]
[1, 10, 10, 10, 10] [True, False, False, False,
False]
[1, 5, 10, 10, 10] [True, True, False, False,
False]
[1, 5, 1, 1, 1] [False, True, False,
False, False]
Input examples and results, HampelFilter(window=3, sigma=3,
scaleFactor=1.4826)
Input data Function output
[1, 5, 1,
1, 1] [False, True, False, False,
False]
[1, 10, 10, 1, 10, 1] [True, False, False, False, False,
False]
[1, 10, 10, 10, 10, 1] [True, False, False, False, False,
True]
[1, 1, 1, 10, 10, 10] [False, False, False, False, False,
False]
Finding the index of the first anomalous element
In
practice, it is often necessary to determine only the first anomalous or the
first among the largest elements in the number series. To do this, you can
use the Python implementation of the HampelAnomalyDetection() function. This function returns the
minimum index of an element in the list of found anomalies, or the index of the
first maximum element in the input series if its index is less than the index
of the anomalous element. If there are no anomalies in the number series, or if
all elements are equal (no maximums), then None is returned.
Input examples and results, HampelAnomalyDetection() with default values
Input data Function output
[1, 1, 1,
1, 111, 1] 4
[1, 1, 10, 1, 1, 1] 2
[111, 1, 1, 1, 1, 1] 0
[111, 1, 1, 1, 1, 111] 0
[1, 11, 1, 111, 1, 1] 1
[1, 1, 1, 111, 99, 11] 3
[-111, 1, 1, 1, 1] 0
[1, 2, 1, -1, 1] 1
[1]
None
[1, 2]
None
[1, 1, 1, 1, 1, 1] None
Filtering stock prices for anomalies
You can
see the use of the above functions using the example of the task of finding
outliers in stock data. To do this, we generated time series with data similar
to random stock prices and anomalies in them using the PriceGenerator library, and then applied these functions
to the data. The functions themselves are in the TKSBrokerAPI library (it contains TradeRoutines module).
The solution to the issue was implemented on Jupyter Notebook: HampelFilteringExample.ipynb. After analysis by the Hampel method anomalies on the price chart can be marked, for example, as shown in the illustration below.
As can be
seen with the naked eye, there are clearly outliers in the series, tails at the
top and bottom of some candles. We are not interested in all such outliers, but
only in tails that are too long. For us, these are signs of a price anomaly.
The Hampel filter coped with the detection of anomalies in a series of 75
candles with a sliding window value equal to the series size, the
number of standard deviations equal to 3 (sigma parameter) and a constant
coefficient characterizing the normal distribution equal to 1.4826 (scale
factor parameter). Similarly, you can analyze for anomalies the size of the
bodies of the candles.
Conclusions
Of course,
Hampel's method also has its drawbacks. For example, when implementing the
algorithm, one may encounter the fact that the anomalies present in the first
and last elements of the series according to the original algorithm will be
ignored. This is due to the use of a sliding window. To workaround the issue,
we had to pre-expand the number series in front and behind by the size of this
window. Only after that it was possible to detect anomalies both in the first
and in the last elements of the initial series.
However,
in practice, the Hampel filter is extremely effective. Thanks to modern
libraries such as Pandas, it handles fairly large numerical series
of tens of thousands elements in a matter of seconds, and is also easy to
optimize and parallelize (using CUDA Python and Numba’s @cuda.jit decorator, or Python Multiprocessing ThreadPool).
Thus, if
you have an issue of finding anomalies in a number series, try looking towards
filtering them using the Hampel method. It gives fast results and is also easy
to understand and implement.
Authors:
Timur Gilmullin, Ph.D., Mansur Gilmullin, Ph.D.
Sources:
- Hampel F. R. The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69, 382–393, 1974.
- Hancong Liu, Sirish Shah and Wei Jiang. On-line outlier detection and
data cleaning. Computers and Chemical Engineering. Vol. 28, March 2004, pp.
1635–1647.
Link: https://sites.ualberta.ca/~slshah/files/on_line_outlier_det.pdf - Lewinson Eryk. Outlier Detection with Hampel Filter. September 26, 2019.
Link: https://towardsdatascience.com/outlier-detection-with-hampel-filter-85ddf523c73d
Source code of the modules used in examples:
- PriceGenerator — this module can generate time series with data like
random stock prices with anomalies.
Link: https://github.com/Tim55667757/PriceGenerator/blob/master/README.md - TKSBrokerAPI —
this module contains the TradeRoutines library with the necessary functions.
Link: https://github.com/Tim55667757/TKSBrokerAPI/blob/master/README_EN.md - TradeRoutines — this library has methods for analyzing number series
for anomalies.
Link: https://tim55667757.github.io/TKSBrokerAPI/docs/tksbrokerapi/TradeRoutines.html - TestAnomalyFilter — a simple script with an example of generating an OHLCV time series and detecting for anomaly using the Hampel method.
Link: https://github.com/Tim55667757/TKSBrokerAPI/blob/develop/docs/examples/TestAnomalyFilter.py - HampelFilteringExample.ipynb — Jupyter Notebook with an example of using
the Hampel filter for the issue of finding outliers in stock data.
Link: https://nbviewer.org/github/Tim55667757/TKSBrokerAPI/blob/develop/docs/examples/HampelFilteringExample_EN.ipynb